Hbase集群挂掉的一次惊险经历

本文转载自微信公众号「Java大数据与数据仓库」,作者柯同学。转载本文

本文转载自微信公众号「Java大数据与数据仓库」,作者柯同学。转载本文请联系Java大数据与数据仓库公众号。

这是以前的一次hbase集群异常事故,由于不规范操作,集群无法启动,在腾讯云大佬的帮助下,花了一个周末才修好,真的是一次难忘的回忆。

版本信息

cdh-6.0.1 hadoop-3.0 hbase-2.0.0

问题

想在空闲时候重启一下hbase释放一下内存,顺便修改一下yarn的一些配置,结果停掉后,hbase起不来了,错误信息就是hbase:namespace表is not online,master一直初始化,具体错误信息:

15:41:59.313[ProcExecTimeout]WARNorg.apache.hadoop.hbase.master.assignment.AssignmentManager-STUCKRegion-In-Transitionrit=OPENING,location=node4,16020,1589648302672,table=real_time_data,region=74cac15d22e99800ad0ace14c9ed74d615:41:59.313[ProcExecTimeout]WARNorg.apache.hadoop.hbase.master.assignment.AssignmentManager-STUCKRegion-In-Transitionrit=OPENING,location=node3,16020,1596598630022,table=real_time_data,region=8e68891d5826c09974d81ad5d705c3b615:41:59.313[ProcExecTimeout]WARNorg.apache.hadoop.hbase.master.assignment.AssignmentManager-STUCKRegion-In-Transitionrit=OPENING,location=node3,16020,1596598630022,table=real_time_data,region=75c42d75e2556bf70ff527f2425e850915:41:59.313[ProcExecTimeout]WARNorg.apache.hadoop.hbase.master.assignment.AssignmentManager-STUCKRegion-In-Transitionrit=OPENING,location=node3,16020,1596598630022,table=real_time_data,region=2eee04869ac2c35984d4d22e6e9f2f3115:42:08.264[master/node3:16000]INFOorg.apache.hadoop.hbase.client.RpcRetryingCallerImpl-Callexception,tries=15,retries=15,started=128887msago,cancelled=false,msg=org.apache.hadoop.hbase.NotServingRegionException:hbase:namespace,,1558205786137.40562c48c9210c06813adce48773cb6a.isnotonlineonnode1,16020,1596957741742atorg.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3273)atorg.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3250)atorg.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)atorg.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2446)atorg.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)atorg.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)atorg.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)atorg.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)atorg.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304),details=row\’default\’ontable\’hbase:namespace\’atregion=hbase:namespace,,1558205786137.40562c48c9210c06813adce48773cb6a.,hostname=node1,16020,1589648239142,seqNum=55……15:44:58.229[qtp1792826268-435]WARNorg.eclipse.jetty.servlet.ServletHandler-/master-statusorg.apache.hadoop.hbase.PleaseHoldException:Masterisinitializingatorg.apache.hadoop.hbase.master.HMaster.isInMaintenanceMode(HMaster.java:2827)~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:271)~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:389)~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:380)~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:81)~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atjavax.servlet.http.HttpServlet.service(HttpServlet.java:687)~[javax.servlet-api-3.1.0.jar:3.1.0]atjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)~[javax.servlet-api-3.1.0.jar:3.1.0]atorg.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:112)~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.apache.hadoop.hbase.http.ClickjackingPreventionFilter.doFilter(ClickjackingPreventionFilter.java:48)~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1374)~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:49)~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:49)~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]atorg.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)[jetty-security-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.Server.handle(Server.java:534)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)[jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)[jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)[jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)[jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)[jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)[jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)[jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)[jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]atorg.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)[jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]atjava.lang.Thread.run(Thread.java:745)[?:1.8.0_121]

常规操作

到这里,我尝试使用hbck命令查看详情并修复,发现hbase2.0.0版本hbck已经废弃了修复的命令。

———————————————————————–NOTE:AsofHBaseversion2.0,thehbcktoolissignificantlychanged.Ingeneral,allRead-Onlyoptionsaresupportedandcanbebeusedsafely.Most-fix/-repairoptionsareNOTsupported.Pleaseseeusagebelowfordetailsonwhichoptionsarenotsupported.———————————————————————–省略若干…省略若干…省略若干…NOTE:FollowingoptionsareNOTsupportedasofHBaseversion2.0+.UNSUPPORTEDMetadataRepairoptions:(expertfeatures,usewithcaution!)-fixTrytofixregionassignments.Thisisforbackwardscompatiblity-fixAssignmentsTrytofixregionassignments.Replacestheold-fix-fixMetaTrytofixmetaproblems.ThisassumesHDFSregioninfoisgood.-fixHdfsHolesTrytofixregionholesinhdfs.-fixHdfsOrphansTrytofixregiondirswithno.regioninfofileinhdfs-fixTableOrphansTrytofixtabledirswithno.tableinfofileinhdfs(onlinemodeonly)-fixHdfsOverlapsTrytofixregionoverlapsinhdfs.-maxMergeWhenfixingregionoverlaps,allowatmostregionstomerge.(n=5bydefault)-sidelineBigOverlapsWhenfixingregionoverlaps,allowtosidelinebigoverlaps-maxOverlapsToSidelineWhenfixingregionoverlaps,allowatmostregionstosidelinepergroup.(n=2bydefault)-fixSplitParentsTrytoforceofflinesplitparentstobeonline.-removeParentsTrytoofflineandsidelinelingeringparentsandkeepdaughterregions.-fixEmptyMetaCellsTrytofixhbase:metaentriesnotreferencinganyregion(emptyREGIONINFO_QUALIFIERrows)UNSUPPORTEDMetadataRepairshortcuts-repairShortcutfor-fixAssignments-fixMeta-fixHdfsHoles-fixHdfsOrphans-fixHdfsOverlaps-fixVersionFile-sidelineBigOverlaps-fixReferenceFiles-fixHFileLinks-repairHolesShortcutfor-fixAssignments-fixMeta-fixHdfsHoles

然后,查阅资料看到了hbck2,官方地址:https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2, 这个工具,本来以为抓住了救命的稻草,结果:

===================================================================HBCK2OverviewHBCK2iscurrentlyasimpletoolthatdoesonethingatatimeonly.Inhbase-2.x,theMasteristhefinalarbiterofallstate,soageneralprincipalformostHBCK2commandsisthatitaskstheMastertoeffectallrepair.ThismeansaMastermustbeupbeforeyoucanrunHBCK2commands.TheHBCK2implementationapproachistomakeuseofanHbckServicehostedontheMaster.TheServicepublishesafewmethodsfortheHBCK2tooltopullon.Therefore,forHBCK2commandsrelyingonMaster\’sHbckServicefacade,firstthingHBCK2doesispoketheclustertoensuretheserviceisavailable.ThiswillfailiftheremoteServerdoesnotpublishtheServiceoriftheHbckServiceislackingtherequestedmethod.Forthelattercase,ifyoucan,updateyourclustertoobtainmorefixfacility.HBCK2versionsshouldbeabletoworkacrossmultiplehbase-2releases.Itwillfailwithacomplaintifitisunabletorun.ThereisnoHbckServiceinversionsofhbasebefore2.0.3and2.1.1.HBCK2willnotworkagainsttheseversions.Nextwelookfirstathowyou\’find\’issuesinyourrunningclusterfollowedbyasectiononhowyou\’fix\’foundproblems.===================================================================

wtm,服了。hbase2.0.0 ~ 2.0.2以及hbase2.1.0 ~ 2.1.0是不适用的,既不能使用hbck,也不能使用hbck2,这里出现了断层。

解决办法

1. 修复master,让集群正常启动

由于目前master无法初始化,集群无法启动,因为元数据表hbase:meta信息有损坏,hbase:namespace表is not online,首先需要让hbase:namespace表上线,启动hbase集群再说,否则后续的修复工作都进行不了;然后修复那些表(此时内心是崩溃的,都准备重搭建集群了)。

查看hbase源码,发现hbase元数据表hbase:namespace表如果没有会重建,TableNamespaceManager.java:

Hbase集群挂掉的一次惊险经历

思路:备份hbase:namespace表hdfs数据,删除hbase:namespace表,启动时让其重建,然后将备份的数据bulkload进新建的hbase:namespace表中去。

删除hbase:meta中hbase:namespace那一行数据,并且mv走hbase:namespace表对应的hdfs目录到临时目录备份,这样相当于把hbase:namespace这个表删除了。

Hbase集群挂掉的一次惊险经历

然后,重启hbase集群,namespace表会被重建,集群终于起来了。此时,hbase:namespace这张表里面保存的namespace只有default这个默认的namespace,我们通过bulkload命令,把临时目录里面的hfile文件移到hbase:namespace这张表里面,这样就还原了命名空间表。

2. 修复hbase表

很不容易,hbase集群已经起来了,通过web ui发现,此时里面的表都是空的,无法找到每个region对应的hdfs数据文件。

Hbase集群挂掉的一次惊险经历

由于hbase中的hbase:meta表保存所有表的region分配等信息,现在由于集群异常停止,破坏了hbase:meta表,应该是hbase:meta表有损坏,导致hbase:namespace表无法找到对应分配的region。

思路:通过.regioninfo来修复hbase:meta表,参考博客:https://blog.csdn.net/xyzkenan/article/details/103476160

Hbase集群挂掉的一次惊险经历

工具地址:https://github.com/DarkPhoenixs/hbase-meta-repair

总结

总算解决了,虚惊一场,珍惜美好的生活吧。这次异常,hbase集群无法启动,两个表现:

namespace region is not online; Master is initializing;

解决思路是:

首先删除hbase:namespace表,让hbase启动服务时自动创建,解决了hbase无法启动问题: 然后,通过.regioninfo文件修复hbase:meta表。

给TA打赏
共{{data.count}}人
人已打赏
云计算

微软将以197亿美元收购语音识别企业纽安斯

2021-4-14 10:55:48

云计算

同时入选IDC、安全牛、数世咨询、CCIA等权威第三方机构报告的网络空间测绘技术长什么样?

2021-4-14 11:41:36

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索