# 控制脚本
# 1.控制脚本
Ignite提供了一个./control.sh
命令行脚本,它可以监控和控制集群的状态,它位于Ignite主目录的/bin
文件夹中。
# 1.1.激活、冻结和拓扑管理
首先,./control.sh
用于集群基线拓扑的激活/冻结以及节点的管理,具体可以看相关的章节。
# 1.2.缓存状态监控
./control.sh
提供了若干以--cache list
开头的命令用于缓存的监控,这些命令可以看到部署的带有关联参数的缓存的列表,及其在缓存组内的分布,还有一个命令可以看已有的原子化的序列。
# Displays list of all caches with affinity parameters.
./control.sh --cache list .*
# Displays list of caches with affinity parameters which names start with "account-".
./control.sh --cache list account-.*
# Displays info about cache groups distribution of all caches.
./control.sh --cache list .* groups
# Displays info about cache groups distribution of caches which names start with "account-".
./control.sh --cache list account-.* groups
# Displays info about all atomic sequences.
./control.sh --cache list .* seq
# Displays info about atomic sequnces which names start with "counter-".
./control.sh --cache list counter-.*
# 1.3.事务争用检测
contention
命令可以观测到多个事务对于同一个键的锁竞争状态,如果遇到了长时间运行或者挂起的事务,该命令会很有用,比如:
# Reports all keys that are point of contention for at least 5 transactions on all cluster nodes.
./control.sh --cache contention 5
# Reports all keys that are point of contention for at least 5 transactions on specific server node.
./control.sh --cache contention 5 f2ea-5f56-11e8-9c2d-fa7a
如果存在高度争用的键,该工具会存储大量的信息,包括键、事务和争用所在的节点,比如:
[node=TcpDiscoveryNode [id=d9620450-eefa-4ab6-a821-644098f00001, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47501], discPort=47501, order=2, intOrder=2, lastExchangeTime=1527169443913, loc=false, ver=2.5.0#20180518-sha1:02c9b2de, isClient=false]]
// No contention on node d9620450-eefa-4ab6-a821-644098f00001.
[node=TcpDiscoveryNode [id=03379796-df31-4dbd-80e5-09cef5000000, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1527169443913, loc=false, ver=2.5.0#20180518-sha1:02c9b2de, isClient=false]]
TxEntry [cacheId=1544803905, key=KeyCacheObjectImpl [part=0, val=0, hasValBytes=false], queue=10, op=CREATE, val=UserCacheObjectImpl [val=0, hasValBytes=false], tx=GridNearTxLocal[xid=e9754629361-00000000-0843-9f61-0000-000000000001, xidVersion=GridCacheVersion [topVer=138649441, order=1527169439646, nodeOrder=1], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=ACTIVE, invalidate=false, rollbackOnly=false, nodeId=03379796-df31-4dbd-80e5-09cef5000000, timeout=0, duration=1247], other=[]]
TxEntry [cacheId=1544803905, key=KeyCacheObjectImpl [part=0, val=0, hasValBytes=false], queue=10, op=READ, val=null, tx=GridNearTxLocal[xid=8a754629361-00000000-0843-9f61-0000-000000000001, xidVersion=GridCacheVersion [topVer=138649441, order=1527169439656, nodeOrder=1], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=ACTIVE, invalidate=false, rollbackOnly=false, nodeId=03379796-df31-4dbd-80e5-09cef5000000, timeout=0, duration=1175], other=[]]
TxEntry [cacheId=1544803905, key=KeyCacheObjectImpl [part=0, val=0, hasValBytes=false], queue=10, op=READ, val=null, tx=GridNearTxLocal[xid=6a754629361-00000000-0843-9f61-0000-000000000001, xidVersion=GridCacheVersion [topVer=138649441, order=1527169439654, nodeOrder=1], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=ACTIVE, invalidate=false, rollbackOnly=false, nodeId=03379796-df31-4dbd-80e5-09cef5000000, timeout=0, duration=1175], other=[]]
TxEntry [cacheId=1544803905, key=KeyCacheObjectImpl [part=0, val=0, hasValBytes=false], queue=10, op=READ, val=null, tx=GridNearTxLocal[xid=7a754629361-00000000-0843-9f61-0000-000000000001, xidVersion=GridCacheVersion [topVer=138649441, order=1527169439655, nodeOrder=1], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=ACTIVE, invalidate=false, rollbackOnly=false, nodeId=03379796-df31-4dbd-80e5-09cef5000000, timeout=0, duration=1175], other=[]]
TxEntry [cacheId=1544803905, key=KeyCacheObjectImpl [part=0, val=0, hasValBytes=false], queue=10, op=READ, val=null, tx=GridNearTxLocal[xid=4a754629361-00000000-0843-9f61-0000-000000000001, xidVersion=GridCacheVersion [topVer=138649441, order=1527169439652, nodeOrder=1], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=ACTIVE, invalidate=false, rollbackOnly=false, nodeId=03379796-df31-4dbd-80e5-09cef5000000, timeout=0, duration=1175], other=[]]
// Node 03379796-df31-4dbd-80e5-09cef5000000 is place for contention on key KeyCacheObjectImpl [part=0, val=0, hasValBytes=false].
# 1.4.一致性检查命令
该脚本还提供了一组命令,验证内部数据的一致性。
首先,该命令可用于调试和排错的目的,尤其在活跃的开发节点。
其次,如果怀疑一个查询,比如SQL返回了不完整或者错误的结果集,该命令可以验证是否真的存在数据不一致的情况。
最后,一致性检查命令可以用作集群健康检查工具的一部分。
下面会更详细地描述一些使用场景:
分区校验和验证
即使主节点和备份节点之间的更新计数器和大小相等,也可能会出现主备之间因某些严重故障而出现差异的情况。./control.sh
工具中的idle_verify
命令会计算和比较整个集群的分区哈希值,然后如果有不同会进行报告。它可以指定一个需要验证的缓存列表,比如:
# Checks partitions of all caches that their partitions actually contain same data.
./control.sh --cache idle_verify
# Checks partitions of specific caches that their partitions actually contain same data.
./control.sh --cache idle_verify cache1,cache2,cache3
如果出现了分区的差异,会输出冲突的分区列表,如下:
idle_verify check has finished, found 2 conflict partitions.
Conflict partition: PartitionKey [grpId=1544803905, grpName=default, partId=5]
Partition instances: [PartitionHashRecord [isPrimary=true, partHash=97506054, updateCntr=3, size=3, consistentId=bltTest1], PartitionHashRecord [isPrimary=false, partHash=65957380, updateCntr=3, size=2, consistentId=bltTest0]]
Conflict partition: PartitionKey [grpId=1544803905, grpName=default, partId=6]
Partition instances: [PartitionHashRecord [isPrimary=true, partHash=97595430, updateCntr=3, size=3, consistentId=bltTest1], PartitionHashRecord [isPrimary=false, partHash=66016964, updateCntr=3, size=2, consistentId=bltTest0]]
idle_verify检查期间集群应该为空闲状态
当idle_verify
计算哈希值时,所有的更新都要停止,否则可能会出现假阳性的错误结果。如果正在不断地更新,是无法在分布式系统中比较很大的数据集的。
SQL索引一致性验证
validate_indexes
命令可以在所有的集群节点本地对给定缓存的索引进行验证。
验证过程会进行如下的检查:
- 主索引指向的所有键值条目,对于二级SQL索引(如果有)都可以访问;
- 主索引指向的所有键值条目,都可以访问,主索引中的引用不应该出现在任何地方;
- 二级SQL索引引用的键值条目,主索引都可以访问。
# Checks indexes of all caches on all cluster nodes.
./control.sh --cache validate_indexes
# Checks indexes of specific caches on all cluster nodes.
./control.sh --cache validate_indexes cache1,cache2
# Checks indexes of specific caches on node with given node ID.
./control.sh --cache validate_indexes cache1,cache2 f2ea-5f56-11e8-9c2d-fa7a
如果索引指向了不存在的条目(或者条目未被索引),会输出错误信息,如下:
PartitionKey [grpId=-528791027, grpName=persons-cache-vi, partId=0] ValidateIndexesPartitionResult [updateCntr=313, size=313, isPrimary=true, consistentId=bltTest0]
IndexValidationIssue [key=0, cacheName=persons-cache-vi, idxName=_key_PK], class org.apache.ignite.IgniteCheckedException: Key is present in CacheDataTree, but can't be found in SQL index.
IndexValidationIssue [key=0, cacheName=persons-cache-vi, idxName=PERSON_ORGID_ASC_IDX], class org.apache.ignite.IgniteCheckedException: Key is present in CacheDataTree, but can't be found in SQL index.
validate_indexes has finished with errors (listed above).
validate_indexes检查期间集群应该为空闲状态
和idle_verify
命令一样,只有所有的更新都停止,索引验证工具才能正常工作,否则,可能会出现检查线程与更新条目/索引的线程之间的竞争,这将导致假阳性错误报告。