使用CloudLens for SLS监控Project资源配额最佳实践-阿里云(云淘科技)

本文主要介绍如何使用CloudLens for SLS中全局错误日志、监控指标做Project 资源配额的水位监控 、超限监控。

背景介绍

Alibaba Cloud Lens 基于 SLS 构建统一云产品可观测能力,支持一键开启实例日志(重要日志、详细日志、作业运行日志)和全局日志(审计日志、计费日志、错误日志、监控指标)的采集功能。

日志分类 子分类 监控场景说明
实例日志 详细日志(收费) 访问流量监控访问异常监控
重要日志(免费) 消费组监控Logtail采集监控
作业运行日志(免费) 数据加工(新版)监控定时SQL任务监控
全局日志 审计日志(免费) 资源操作监控
错误日志(免费) 额度超限监控访问异常监控操作异常监控
监控指标(免费) 访问流量监控访问异常监控资源配额水位监控
计费日志(免费) 资源用量跟踪

各日志说明参考CloudLens日志索引表:https://help.aliyun.com/document_detail/456901.html?spm=a2c4g.456864.0.0.e979723c8We7zA

使用场景

本文主要介绍如何使用CloudLens for SLS中全局错误日志、监控指标做Project 资源配额的水位监控超限监控 以及 如何提交资源配额提升申请。

使用前提

  • 开通CloudLens for SLS 以及全局错误日志、监控指标
  • image.png

  • 全局监控日志需存储在同一个Project下
  • 为了构建实时资源配额水位监控,全局日志的几种监控日志(错误日志、指标监控)需存放在相同的Project下。同时为了避免监控日志存放在业务Project导致监控占用Project的Quota,可直接挑选一个固定地域的目标Project,如杭州地域:log-service-{用户ID}-cn-hangzhou。

    CloudLens for SLS 额度监控大盘

    资源配额预警概览

    报表提供资源配额预警概览 (水位超过80%)以及 额度超限分布image.png

    Project重点资源配额实时水位详情

    包含Project部分基础资源配额以及数据读写资源配额的实时水位详情image.pngimage.pngimage.png

    Project资源配额超限详情

    image.png

    监控实践

  • 额度监控项分类说明:
  • 分类 监控项 说明
    实时水位监控 基础资源配额水位监控
    • 监控Project 内LogStore数、机器组数、Logtail采集配置水位是否超阈值预期百分比
    • 依赖时序库:internal-monitor-metric
    数据读写配额水位监控
    • 监控Project写入流量、Project写入次数超配额次数
    • 依赖时序库:internal-monitor-metric
    额度超限监控 资源配额超限次数监控
    • 监控基础配额、数据读写超配额次数
    • 依赖日志库:internal-error_log
  • 高级监控项细分说明如下:
  • 分类 场景 监控项 说明
    基础资源配额 LogStore 实时水位监控
    • 监控Project下LogStore数水位是否超阈值预期百分比
    • 依赖时序库:internal-monitor-metric
    额度超限监控
    • 监控Project下LogStore数超配额次数
    • 依赖日志库:internal-error_log
    机器组 水位监控
    • 监控Project下机器组数水位是否超阈值预期百分比
    • 依赖时序库:internal-monitor-metric
    额度超限监控
    • 监控Project下机器组数超配额次数
    • 依赖日志库:internal-error_log
    Logtail采集配置 水位监控
    • 监控Project下Logtail采集配置数水位是否超阈值预期百分比
    • 依赖时序库:internal-monitor-metric
    额度超限监控
    • 监控Project下Logtail采集配置数超配额次数
    • 依赖日志库:internal-error_log
    数据读写资源配额 Project写入流量 水位监控
    • 监控Project写入流量水位是否超阈值预期百分比
    • 依赖时序库:internal-monitor-metric
    额度超限监控
    • 监控Project写入流量超配额次数
    • 依赖日志库:internal-error_log
    Project写入次数 水位监控
    • 监控Project写入次数水位是否超阈值预期百分比
    • 依赖时序库:internal-monitor-metric
    额度超限监控
    • 监控Project写入次数超配额次数
    • 依赖日志库:internal-error_log

    基础监控

    基础资源配额水位监控

    1、确认告警SQL:15min定时检查LogStore数、机器组数、Logtail采集配置水位是否达到告警阈值。image.png注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果查询SQL如下:(告警只能对比结果中最多1000条数据是否满足告警条件,建议告警SQL内先针对水位做下筛选,比如此处logstore_ratio > 80 or machine_group_ratio > 80 or logtail_config_ratio > 80 ) 80 or machine_group_ratio > 80 or logtail_config_ratio > 80) limit 10000″ data-tag=”codeblock” outputclass=”language-sql” class=”pre codeblock language-sql”>* | select Project, region, logstore_ratio, machine_group_ratio, logtail_config_ratio from
    (SELECT A.id as Project , A.region as region,
    round(COALESCE(SUM(B.count_logstore), 0)/cast(json_extract(A.quota, ‘$.logstore’) as double) * 100, 3) as logstore_ratio, cast(json_extract(A.quota, ‘$.logstore’) as double) as quota_logstore,
    round(COALESCE(SUM(C.count_machine_group), 0)/cast(json_extract(A.quota, ‘$.machine_group’) as double) * 100, 3) as machine_group_ratio, cast(json_extract(A.quota, ‘$.machine_group’) as double) as quota_machine_group,
    round(COALESCE(SUM(D.count_logtail_config), 0)/cast(json_extract(A.quota, ‘$.config’) as double) * 100, 3) as logtail_config_ratio, cast(json_extract(A.quota, ‘$.config’) as double) as quota_logtail_config
    FROM “resource.sls.cmdb.project” as A
    LEFT JOIN (
    SELECT project, COUNT(*) AS count_logstore
    FROM “resource.sls.cmdb.logstore” as B
    GROUP BY project
    ) AS B ON A.id = B.project
    LEFT JOIN (
    SELECT project, COUNT(*) AS count_machine_group
    FROM “resource.sls.cmdb.machine_group” as C
    GROUP BY project
    ) AS C ON A.id = C.project
    LEFT JOIN (
    SELECT project, COUNT(*) AS count_logtail_config
    FROM “resource.sls.cmdb.logtail_config” as D
    GROUP BY project
    ) AS D ON A.id = D.project
    group by A.id, A.quota, A.region)
    where quota_logstore is not null and quota_machine_group is not null and quota_logtail_config is not null and (logstore_ratio > 80 or machine_group_ratio > 80 or logtail_config_ratio > 80) limit 100002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的LogStore数、机器组数、Logtail采集配置其中一个水位超过额度的90%时告警级别为严重
    • 当有Project的LogStore数、机器组数、Logtail采集配置其中一个水位超过额度的80%时告警级别为中

    image.png

    数据读写配额水位监控

    1、确认告警SQL:每分钟定时检查Project写入流量、写入次数水位是否达到告警阈值。image..png注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果查询SQL:(告警只能对比结果中最多1000条数据是否满足告警条件,建议告警SQL内先针对写入流量/写入次数做下筛选,比如此处where inflow_ratio > 80 or write_cnt_ratio > 80 ) 80 or write_cnt_ratio > 80 limit 10000″ data-tag=”codeblock” outputclass=”language-sql” class=”pre codeblock language-sql”>(*)| select Project, region, inflow_ratio, write_cnt_ratio from (SELECT cmdb.id as Project, cmdb.region as region, round(COALESCE(M.name1,0)/round(cast(json_extract(cmdb.quota, ‘$.inflow_per_min’) as double)/1000000000, 3) * 100, 3) as inflow_ratio, round(COALESCE(M.name2,0)/cast(json_extract(cmdb.quota, ‘$.write_cnt_per_min’) as double) * 100, 3) as write_cnt_ratio
    from “resource.sls.cmdb.project” as cmdb
    LEFT JOIN (
    select project, round(MAX(name1)/1000000000, 3) as name1, MAX(name2) as name2 from (SELECT __time_nano__ as time, element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) as project, sum(CASE WHEN __name__ = ‘logstore_origin_inflow_bytes’ THEN __value__ ELSE NULL END) AS name1,
    sum(CASE WHEN __name__ = ‘logstore_write_count’ THEN __value__ ELSE NULL END) AS name2
    FROM “internal-monitor-metric.prom” where __name__ in (‘logstore_origin_inflow_bytes’,’logstore_write_count’ ) and regexp_like(element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) , ‘.*’) group by project,time )group by project) AS M ON cmdb.id = M.project) where inflow_ratio > 80 or write_cnt_ratio > 80 limit 100002、告警配置查询区间选择相对5分钟,依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的Project写入流量、写入次数其中一个水位超过额度的90%时告警级别为严重
    • 当有Project的Project写入流量、写入次数其中一个水位超过额度的80%时告警级别为中

    image.pngimage.png

    资源配额超限次数监控

    1、确认告警SQL:15min定时检查是否有额度超限发生。image.png查询SQL:((* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed or ErrorCode: ShardWriteQuotaExceed or ErrorCode: ShardReadQuotaExceed)))| SELECT Project,
    CASE
    WHEN ErrorMsg like ‘%Project write quota exceed: inflow%’ then ‘Project写入流量超限’
    WHEN ErrorMsg like ‘%Project write quota exceed: qps%’ then ‘Project写入次数超限’
    WHEN ErrorMsg like ‘%dashboard quota exceed%’ then ‘报表额度超限’
    WHEN ErrorMsg like ‘%config count%’ then ‘Logtail采集配置超限’
    WHEN ErrorMsg like ‘%machine group count%’ then ‘机器组超限’
    WHEN ErrorMsg like ‘%Alert count %’ then ‘告警超限’
    WHEN ErrorMsg like ‘%logstore count %’ then ‘LogStore数超限’
    WHEN ErrorMsg like ‘%shard count%’ then ‘Shard数超限’
    WHEN ErrorMsg like ‘%shard write bytes%’ then ‘Shard写入超限’
    WHEN ErrorMsg like ‘%shard write quota%’ then ‘Shard写入超限’
    WHEN ErrorMsg like ‘%user can only run%’ then ‘SQL分析操作并发数超限’
    ELSE ErrorMsg
    END AS ErrorMsg,
    COUNT(1) AS count GROUP BY Project, ErrorMsg Limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有任意额度超限10次错误告警级别为严重
    • 当有任意额度发生超限1次错误时告警级别为中

    image.png

    高级监控

    以下是基础监控的细分项,一般情况下不需要,如果需更精细的告警监控,可以参考。

    LogStore监控

    水位监控

    1、确认告警SQL:15min定时检查LogStore数水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果。image.png查询SQL:* | select Project, region, round(count_logstore/quota_logstore * 100, 3) as logstore_ratio from
    (SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_logstore), 0) AS count_logstore , cast(json_extract(A.quota, ‘$.logstore’) as double) as quota_logstore
    FROM “resource.sls.cmdb.project” as A
    LEFT JOIN (
    SELECT project, COUNT(*) AS count_logstore
    FROM “resource.sls.cmdb.logstore” as B
    GROUP BY project
    ) AS B ON A.id = B.project
    group by A.id, A.quota, A.region) where quota_logstore is not null order by logstore_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的LogStore数超过额度的90%时告警级别为严重
    • 当有Project的LogStore数超过额度的80%时告警级别为中

    此处需注意,告警触发条件配置多个时,判断顺序是从上至下,因此logstore_ratio>90需配置在logstore_ratio>80的上面。image.png

    超限监控

    1、确认告警SQL:15min定时检查LogStore是否发生超限现象。image.png查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like ‘%logstore count %’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的LogStore发生超限10次错误告警级别为严重
    • 当有Project的LogStore发生超限1次错误时告警级别为中

    image.png

    机器组监控

    水位监控

    1、确认告警SQL:15min定时检查机器组数水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果image.png查询SQL:* | select Project, region, round(count_machine_group/quota_machine_group * 100, 3) as machine_group_ratio from
    (SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_machine_group), 0) AS count_machine_group , cast(json_extract(A.quota, ‘$.machine_group’) as double) as quota_machine_group
    FROM “resource.sls.cmdb.project” as A
    LEFT JOIN (
    SELECT project, COUNT(*) AS count_machine_group
    FROM “resource.sls.cmdb.machine_group” as B
    GROUP BY project
    ) AS B ON A.id = B.project
    group by A.id, A.quota, A.region) where quota_machine_group is not null order by machine_group_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的机器组超过额度的90%时告警级别为严重
    • 当有Project的机器组超过额度的80%时告警级别为中

    image.png

    超限监控

    1、确认告警SQL:15min定时检查机器组是否发生超限现象。image.png查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like ‘%machine group count%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的机器组发生超限10次错误告警级别为严重
    • 当有Project的机器组发生超限1次错误时告警级别为中

    image.png

    Logtail采集配置

    水位监控

    1、确认告警SQL:15min定时检查Logtail采集配置数水位是否达到告警阈值。注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果image.png查询SQL:* | select Project, region, round(count_logtail_config/quota_logtail_config * 100, 3) as logtail_config_ratio from
    (SELECT A.id as Project , A.region as region, COALESCE(SUM(B.count_logtail_config), 0) AS count_logtail_config , cast(json_extract(A.quota, ‘$.config’) as double) as quota_logtail_config
    FROM “resource.sls.cmdb.project” as A
    LEFT JOIN (
    SELECT project, COUNT(*) AS count_logtail_config
    FROM “resource.sls.cmdb.logtail_config” as B
    GROUP BY project
    ) AS B ON A.id = B.project
    group by A.id, A.quota, A.region) where quota_logtail_config is not null order by logtail_config_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的Logtail采集配置数超过额度的90%时告警级别为严重
    • 当有Project的Logtail采集配置数超过额度的80%时告警级别为中

    image.png

    超限监控

    1、确认告警SQL:15min定时检查LogStore是否发生超限现象。image.png查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like ‘%config count%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project的Logtail采集配置发生超限10次错误告警级别为严重
    • 当有Project的Logtail采集配置发生超限1次错误时告警级别为中

    image.png

    Project写入流量监控

    水位监控

    1、确认告警SQL:每分钟定时检查相对5分钟内Project写入流量水位是否达到告警阈值。image..pngSQL详情:(*)| SELECT Project, region , round(count_inflow/cast(quota_inflow as double) * 100, 3) as inflow_ratio
    FROM
    (SELECT cmdb.id as Project, cmdb.region as region, COALESCE(M.name1,0) as count_inflow, round(cast(json_extract(cmdb.quota, ‘$.inflow_per_min’) as double)/1000000000, 3) as quota_inflow from “resource.sls.cmdb.project” as cmdb
    LEFT JOIN (
    select project, round(MAX(name1)/1000000000, 3) as name1 from (SELECT __time_nano__ as time, element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) as project, sum(CASE WHEN __name__ = ‘logstore_origin_inflow_bytes’ THEN __value__ ELSE NULL END) AS name1
    FROM “internal-monitor-metric.prom” where __name__ =’logstore_origin_inflow_bytes’ and regexp_like(element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) , ‘.*’) group by project,time )group by project) AS M ON cmdb.id = M.project )order by inflow_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project写入流量超过额度的90%时告警级别为严重
    • 当有Project写入流量超过额度的80%时告警级别为中

    image.pngimage.png

    超限监控

    1、确认告警SQL:15min定时检查Project写入流量是否发生超限现象。image.png查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like ‘%Project write quota exceed: inflow%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project写入流量发生超限10次错误告警级别为严重
    • 当有Project写入流量发生超限1次错误时告警级别为中

    image.png

    Project写入次数监控

    水位监控

    1、确认告警SQL:每分钟定时检查相对5分钟内Project写入次数水位是否达到告警阈值。image..png查询SQL:(*)| SELECT Project, region, round(count_write_cnt/cast(quota_write_cnt as double) * 100, 3) as write_cnt_ratio
    FROM
    (SELECT cmdb.id as Project, cmdb.region as region, COALESCE(M.name1,0) as count_write_cnt,
    cast(json_extract(cmdb.quota, ‘$.write_cnt_per_min’) as bigint) as quota_write_cnt from “resource.sls.cmdb.project” as cmdb
    LEFT JOIN (
    select project, MAX(name1) as name1 from (SELECT __time_nano__ as time, element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) as project,
    sum(CASE WHEN __name__ = ‘logstore_write_count’ THEN __value__ ELSE NULL END) AS name1
    FROM “internal-monitor-metric.prom” where __name__ = ‘logstore_write_count’ and regexp_like(element_at( split_to_map(__labels__, ‘|’, ‘#$#’) , ‘project’) , ‘.*’) group by project,time )group by project) AS M ON cmdb.id = M.project ) order by write_cnt_ratio desc limit 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project写入次数超过额度的90%时告警级别为严重
    • 当有Project写入次数超过额度的80%时告警级别为中

    image.pngimage.png

    超限监控

    1、确认告警SQL:15min定时检查Project写入次数是否发生超限现象。image.png查询SQL:* and (ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)| SELECT Project,
    COUNT(1) AS count where ErrorMsg like ‘%Project write quota exceed: qps%’ GROUP BY Project ORDER BY count DESC LIMIT 10002、告警配置依据业务场景配置告警触发条件、以及告警策略:

    • 当有Project写入次数发生超限10次错误告警级别为严重
    • 当有Project写入次数发生超限1次错误时告警级别为中

    image.png

    资源配额调整申请

    操作步骤

  • 登录日志服务控制台。
  • 在Project列表区域,单击目标Project。
  • 单击homepage图标。
  • 单击资源配额对应的管理
  • image.png

  • 资源配额面板中,调整目标资源的配额,然后单击保存
  • image.png

    发表评论