技术贴:一篇文章搞懂日志采集利器 Filebeat

优采云发布时间: 2022-10-22 06:34

　　原文链接：

　　本文使用的 Filebeat 是 7.7.0 版本。文章将解释以下几个方面：

　　什么是 Filebeat

　　Filebeat和Beats的关系

　　首先，Filebeat 是 Beats 的一员。

　　Beats 是一个轻量级的日志采集器。事实上，Beats 家族有 6 个成员。在早期的 ELK 架构中，Logstash 被用来采集和解析日志，但是 Logstash 会消耗更多的内存、CPU、io 等资源。与 Logstash 相比，Beats 占用的系统 CPU 和内存几乎可以忽略不计。

　　Beats 目前包括六种工具：

　　什么是 Filebeat

　　Filebeat 是一个用于转发和集中日志数据的轻量级交付工具。Filebeat 监控您指定的日志文件或位置，采集日志事件，并将它们转发到 Elasticsearch 或 Logstash 进行索引。

　　Filebeat 的工作原理是这样的：当您启动 Filebeat 时，它会启动一个或多个输入，并在为日志数据指定的位置中查找这些输入。对于 Filebeat 找到的每个日志，Filebeat 都会启动一个采集器。每个采集器读取单个日志以获取新内容并将新日志数据发送到 libbeat，libbeat 将聚合事件并将聚合数据发送到为 Filebeat 配置的输出。

　　工作流程图如下：

　　Filebeat和Logstash的关系

　　由于Logstash是JVM运行的，资源消耗比较大，所以作者后来在Golang中写了一个功能少但资源消耗少的轻量级logstash-forwarder。然而，作者只是一个人。加入公司后，ES公司本身也收购了另一个开源项目Packetbeat，而且这个项目独家使用Golang，拥有一个完整的团队，所以ES公司干脆将logstash-forwarder的开发工作合并到同一个Golang团队来工作，所以新项目名为 Filebeat。

　　Filebeat的原理是什么

　　Filebeat的组成

　　Filebeat 结构：由两个组件组成，inputs（输入）和harvesters（采集器），它们共同工作以跟踪文件并将事件数据发送到您指定的输出。收割机负责读取单个文件的内容。收割机逐行读取每个文件并将内容发送到输出。为每个文件启动一个收割机。收割机负责打开和关闭文件，这意味着文件描述符在收割机运行时保持打开状态。如果文件在采集过程中被删除或重命名，Filebeat 将继续读取该文件。这样做的一个副作用是磁盘上的空间被保留，直到收割机关闭。默认情况下，Filebeat 会保持文件打开，直到达到 close_inactive。

　　关闭收割机可以产生结果：

　　输入负责管理收割机并查找所有要读取的资源。如果输入类型是日志，输入将查找驱动器上与定义的路径匹配的所有文件，并为每个文件启动收割机。每个输入都运行在自己的 Go 进程中，Filebeat 目前支持多种输入类型。每种输入类型都可以定义多次。日志输入检查每个文件以查看是否需要启动收割机，收割机是否已在运行，或者是否可以忽略该文件。

　　Filebeat 如何保存文件的状态

　　Filebeat 会保存每个文件的状态，并经常将状态刷新到磁盘上的注册表文件中。此状态用于记住收割机读取的最后一个偏移量，并确保发送所有日志行。如果输出不可访问（如 Elasticsearch 或 Logstash），Filebeat 将跟踪发送的最后一行，并在输出再次可用时继续读取文件。当 Filebeat 运行时，每个输入的状态信息也保存在内存中。当 Filebeat 重新启动时，来自注册表文件的数据用于重建状态，并且 Filebeat 在最后一个已知位置继续每个收割机。对于每个输入，Filebeat 都会保留它找到的每个文件的状态。由于文件可以重命名或移动，因此文件名和路径不足以识别文件。对于每个文件，

　　Filebeat 如何保证至少一次数据消耗

　　Filebeat 保证事件将至少传递到配置的输出一次，并且不会丢失任何数据。因为它将每个事件的传递状态存储在注册表文件中。在定义的输出被阻塞并且所有事件都未被确认的情况下，Filebeat 将继续尝试发送事件，直到输出确认已接收到事件。如果 Filebeat 在发送事件的过程中关闭，它不会在关闭之前等待输出确认所有事件。当 Filebeat 重新启动时，在 Filebeat 关闭之前未确认的所有事件都会再次发送到输出。这可确保每个事件至少发送一次，但您最终可能会将重复的事件发送到输出。

　　如何播放 Filebeat

　　压缩包安装

　　本文使用压缩包安装，Linux版本，filebeat-7.7.0-linux-x86_64.tar.gz。

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">curl-L-Ohttps://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.7.0-linux-x86_64.tar.gz tar -xzvf filebeat-7.7.0-linux-x86_64.tar.gz

　　配置示例文件：filebeat.reference.yml（包括所有非过时的配置项）

　　配置文件：filebeat.yml

　　基本命令

　　详情见官网：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">export #导出 run #执行（默认执行） test #测试配置 keystore #秘钥存储 modules #模块配置管理 setup #设置初始环境

　　例如：./filebeat test config #用于测试配置文件是否正确

　　输入输出

　　支持的输入组件：

　　Multilinemessages、Azureeventhub、CloudFoundry、Container、Docker、GooglePub/Sub、HTTPJSON、Kafka、Log、MQTT、NetFlow、Office 365 Management Activity API、Redis、s3、Stdin、Syslog、TCP、UDP（最常用的是Log）

　　支持的输出组件：

　　Elasticsearch、Logstash、Kafka、Redis、File、Console、ElasticCloud、Changetheoutputcodec（最常用的是Elasticsearch、Logstash）

　　密钥库的使用

　　keystore主要是防止敏感信息泄露，比如密码等，像ES的密码，这里可以生成一个与ES_PWD的key，一个ES的密码的对应关系，使用的时候可以使用${ES_PWD} ES使用的密码。

　　例如：后面可以通过${ES_PWD}来使用它的值，例如：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">output.elasticsearch.password:"${ES_PWD}"

　　filebeat.yml 配置（以日志输入类型为例）

　　详情见官网：

type: log #input类型为log enable: true #表示是该log类型配置生效 paths： #指定要监控的日志，目前按照Go语言的glob函数处理。没有对配置目录做递归处理，比如配置的如果是： - /var/log/* /*.log #则只会去/var/log目录的所有子目录中寻找以".log"结尾的文件，而不会寻找/var/log目录下以".log"结尾的文件。 recursive_glob.enabled: #启用全局递归模式，例如/foo/**包括/foo, /foo/*, /foo/*/* encoding：#指定被监控的文件的编码类型，使用plain和utf-8都是可以处理中文日志的 exclude_lines: ['^DBG'] #不包含匹配正则的行 include_lines: ['^ERR', '^WARN'] #包含匹配正则的行 harvester_buffer_size: 16384 #每个harvester在获取文件时使用的缓冲区的字节大小 max_bytes: 10485760 #单个日志消息可以拥有的最大字节数。max_bytes之后的所有字节都被丢弃而不发送。默认值为10MB (10485760) exclude_files: ['\.gz$'] #用于匹配希望Filebeat忽略的文件的正则表达式列表 ingore_older: 0 #默认为0，表示禁用，可以配置2h，2m等，注意ignore_older必须大于close_inactive的值.表示忽略超过设置值未更新的 文件或者文件从来没有被harvester收集 close_* #close_ *配置选项用于在特定标准或时间之后关闭harvester。关闭harvester意味着关闭文件处理程序。如果在harvester关闭 后文件被更新，则在scan_frequency过后，文件将被重新拾取。但是，如果在harvester关闭时移动或删除文件，Filebeat将无法再次接收文件 ，并且harvester未读取的任何数据都将丢失。 close_inactive #启动选项时，如果在制定时间没有被读取，将关闭文件句柄 读取的最后一条日志定义为下一次读取的起始点，而不是基于文件的修改时间 如果关闭的文件发生变化，一个新的harverster将在scan_frequency运行后被启动 建议至少设置一个大于读取日志频率的值，配置多个prospector来实现针对不同更新速度的日志文件 使用内部时间戳机制，来反映记录日志的读取，每次读取到最后一行日志时开始倒计时使用2h 5m 来表示 close_rename #当选项启动，如果文件被重命名和移动，filebeat关闭文件的处理读取 close_removed #当选项启动，文件被删除时，filebeat关闭文件的处理读取这个选项启动后，必须启动clean_removed close_eof #适合只写一次日志的文件，然后filebeat关闭文件的处理读取 close_timeout #当选项启动时，filebeat会给每个harvester设置预定义时间，不管这个文件是否被读取，达到设定时间后，将被关闭 close_timeout 不能等于ignore_older,会导致文件更新时，不会被读取如果output一直没有输出日志事件，这个timeout是不会被启动的， 至少要要有一个事件发送，然后haverter将被关闭 设置0 表示不启动 clean_inactived #从注册表文件中删除先前收获的文件的状态 设置必须大于ignore_older+scan_frequency，以确保在文件仍在收集时没有删除任何状态 配置选项有助于减小注册表文件的大小，特别是如果每天都生成大量的新文件 此配置选项也可用于防止在Linux上重用inode的Filebeat问题 clean_removed #启动选项后，如果文件在磁盘上找不到，将从注册表中清除filebeat 如果关闭close removed 必须关闭clean removed scan_frequency #prospector检查指定用于收获的路径中的新文件的频率,默认10s tail_files：#如果设置为true，Filebeat从文件尾开始监控文件新增内容，把新增的每一行文件作为一个事件依次发送， 而不是从文件开始处重新发送所有内容。 symlinks：#符号链接选项允许Filebeat除常规文件外,可以收集符号链接。收集符号链接时，即使报告了符号链接的路径， Filebeat也会打开并读取原始文件。 backoff： #backoff选项指定Filebeat如何积极地抓取新文件进行更新。默认1s，backoff选项定义Filebeat在达到EOF之后 再次检查文件之间等待的时间。 max_backoff： #在达到EOF之后再次检查文件之前Filebeat等待的最长时间 backoff_factor： #指定backoff尝试等待时间几次，默认是2 harvester_limit：#harvester_limit选项限制一个prospector并行启动的harvester数量，直接影响文件打开数 tags #列表中添加标签，用过过滤，例如：tags: ["json"] fields #可选字段，选择额外的字段进行输出可以是标量值，元组，字典等嵌套类型 默认在sub-dictionary位置 filebeat.inputs: fields: app_id: query_engine_12 fields_under_root #如果值为ture，那么fields存储在输出文档的顶级位置 multiline.pattern #必须匹配的regexp模式 multiline.negate #定义上面的模式匹配条件的动作是否定的，默认是false 假如模式匹配条件'^b'，默认是false模式，表示讲按照模式匹配进行匹配将不是以b开头的日志行进行合并 如果是true，表示将不以b开头的日志行进行合并 multiline.match # 指定Filebeat如何将匹配行组合成事件,在之前或者之后，取决于上面所指定的negate multiline.max_lines #可以组合成一个事件的最大行数，超过将丢弃，默认500 multiline.timeout #定义超时时间，如果开始一个新的事件在超时时间内没有发现匹配，也将发送日志，默认是5s max_procs #设置可以同时执行的最大CPU数。默认值为系统中可用的逻辑CPU的数量。 name #为该filebeat指定名字，默认为主机的hostname

　　示例 1：Logstash 作为输出

　　filebeat.yml 配置：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">#=========================== Filebeat inputs ============================= filebeat.inputs: # Each - is an input. Most options can be set at the input level, so # you can use different inputs for various configurations. # Below are the input specific configurations. - type: log # Change to true to enable this input configuration. enabled: true # Paths that should be crawled and fetched. Glob based paths. paths: #配置多个日志路径 -/var/logs/es_aaa_index_search_slowlog.log -/var/logs/es_bbb_index_search_slowlog.log -/var/logs/es_ccc_index_search_slowlog.log -/var/logs/es_ddd_index_search_slowlog.log #- c:\programdata\elasticsearch\logs\* # Exclude lines. A list of regular expressions to match. It drops the lines that are # matching any regular expression from the list. #exclude_lines: ['^DBG'] # Include lines. A list of regular expressions to match. It exports the lines that are # matching any regular expression from the list. #include_lines: ['^ERR', '^WARN'] # Exclude files. A list of regular expressions to match. Filebeat drops the files that # are matching any regular expression from the list. By default, no files are dropped. #exclude_files: ['.gz$'] # Optional additional fields. These fields can be freely picked # to add additional information to the crawled log files for filtering #fields: # level: debug # review: 1 ### Multiline options # Multiline can be used for log messages spanning multiple lines. This is common # for Java Stack Traces or C-Line Continuation # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [ #multiline.pattern: ^\[ # Defines if the pattern set under pattern should be negated or not. Default is false. #multiline.negate: false # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern # that was (not) matched before or after or as long as a pattern is not matched based on negate. # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash #multiline.match: after #================================ Outputs ===================================== #----------------------------- Logstash output -------------------------------- output.logstash: # The Logstash hosts #配多个logstash使用负载均衡机制 hosts: ["192.168.110.130:5044","192.168.110.131:5044","192.168.110.132:5044","192.168.110.133:5044"] loadbalance: true #使用了负载均衡 # Optional SSL. By default is off. # List of root certificates for HTTPS server verifications #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"] # Certificate for SSL client authentication #ssl.certificate: "/etc/pki/client/cert.pem" # Client Certificate Key #ssl.key: "/etc/pki/client/cert.key"

　　./filebeat -e #启动文件节拍

　　Logstash 配置：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">input { beats { port => 5044 } } output { elasticsearch { hosts => ["http://192.168.110.130:9200"] #这里可以配置多个 index => "query-%{yyyyMMdd}" } }

　　示例 2：Elasticsearch 作为输出

　　filebeat.yml 的配置：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">###################### Filebeat Configuration Example ######################### # This file is an example configuration file highlighting only the most common # options. The filebeat.reference.yml file from the same directory contains all the # supported options with more comments. You can use it as a reference. # # You can find the full configuration reference here: # https://www.elastic.co/guide/en/beats/filebeat/index.html # For more available modules and options, please see the filebeat.reference.yml sample # configuration file. #=========================== Filebeat inputs ============================= filebeat.inputs: # Each - is an input. Most options can be set at the input level, so # you can use different inputs for various configurations. # Below are the input specific configurations. - type: log # Change to true to enable this input configuration. enabled: true # Paths that should be crawled and fetched. Glob based paths. paths: -/var/logs/es_aaa_index_search_slowlog.log -/var/logs/es_bbb_index_search_slowlog.log -/var/logs/es_ccc_index_search_slowlog.log -/var/logs/es_dddd_index_search_slowlog.log #- c:\programdata\elasticsearch\logs\* # Exclude lines. A list of regular expressions to match. It drops the lines that are # matching any regular expression from the list. #exclude_lines: ['^DBG'] # Include lines. A list of regular expressions to match. It exports the lines that are # matching any regular expression from the list. #include_lines: ['^ERR', '^WARN'] # Exclude files. A list of regular expressions to match. Filebeat drops the files that # are matching any regular expression from the list. By default, no files are dropped. #exclude_files: ['.gz$'] # Optional additional fields. These fields can be freely picked # to add additional information to the crawled log files for filtering #fields: # level: debug # review: 1 ### Multiline options # Multiline can be used for log messages spanning multiple lines. This is common # for Java Stack Traces or C-Line Continuation # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [ #multiline.pattern: ^\[ # Defines if the pattern set under pattern should be negated or not. Default is false. #multiline.negate: false # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern # that was (not) matched before or after or as long as a pattern is not matched based on negate. # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash #multiline.match: after #============================= Filebeat modules =============================== filebeat.config.modules: # Glob pattern for configuration loading path: ${path.config}/modules.d/*.yml # Set to true to enable config reloading reload.enabled: false # Period on which files under path should be checked for changes #reload.period: 10s #==================== Elasticsearch template setting ========================== #================================ General ===================================== # The name of the shipper that publishes the network data. It can be used to group # all the transactions sent by a single shipper in the web interface. name: filebeat222 # The tags of the shipper are included in their own field with each # transaction published. #tags: ["service-X", "web-tier"] # Optional fields that you can specify to add additional information to the # output. #fields: # env: staging #cloud.auth: #================================ Outputs ===================================== #-------------------------- Elasticsearch output ------------------------------ output.elasticsearch: # Array of hosts to connect to. hosts: ["192.168.110.130:9200","92.168.110.131:9200"] # Protocol - either `http` (default) or `https`. #protocol: "https" # Authentication credentials - either API key or username/password. #api_key: "id:api_key" username: "elastic" password: "${ES_PWD}" #通过keystore设置密码

　　./filebeat -e #启动Filebeat

　　查看Elasticsearch集群，有一个默认索引名filebeat-%{[beat.version]}-%{+yyyy.MM.dd}

　　文件节拍模块

　　官方网站：

　　这里我使用 Elasticsearch 模式来解析 ES 的慢日志查询。操作步骤如下，其他模块操作同理：

　　前提条件：安装 Elasticsearch 和 Kibana 软件，然后使用 Filebeat。

　　具体操作官网为：

　　第一步是配置filebeat.yml文件：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">#============================== Kibana ===================================== # Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API. # This requires a Kibana endpoint configuration. setup.kibana: # Kibana Host # Scheme and port can be left out and will be set to the default (http and 5601) # In case you specify and additional path, the scheme is required: http://localhost:5601/path # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601 host: "192.168.110.130:5601" #指定kibana username: "elastic" #用户 password: "${ES_PWD}" #密码，这里使用了keystore，防止明文密码 # Kibana Space ID # ID of the Kibana Space into which the dashboards should be loaded. By default, # the Default Space will be used. #space.id: #================================ Outputs ===================================== # Configure what output to use when sending the data collected by the beat. #-------------------------- Elasticsearch output ------------------------------ output.elasticsearch: # Array of hosts to connect to. hosts: ["192.168.110.130:9200","192.168.110.131:9200"] # Protocol - either `http` (default) or `https`. #protocol: "https" # Authentication credentials - either API key or username/password. #api_key: "id:api_key" username: "elastic" #es的用户 password: "${ES_PWD}" # es的密码 #这里不能指定index，因为我没有配置模板，会自动生成一个名为filebeat-%{[beat.version]}-%{+yyyy.MM.dd}的索引

　　第二步，配置Elasticsearch的慢日志路径：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">cd filebeat-7.7.0-linux-x86_64/modules.d

　　vim弹性搜索.yml：

　　第三步，使ES模块生效：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">./filebeat modules elasticsearch

　　查看活动模块：

　　./filebeat modules list

　　第四步，初始化环境：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">./filebeat setup -e

　　第五步，启动Filebeat：

<p style="font-size: 12px;font-family: 'Operator Mono', Consolas, Monaco, Menlo, monospace;display: -webkit-box;overflow-x: auto;padding: 16px;color: rgb(171, 178, 191);background: rgb(40, 44, 52);border-radius: 0px;margin-left: 8px;margin-right: 8px;">./filebeat -e

　　再看Elasticsearch集群，如下图，慢日志查询的日志是自动解析的：

　　至此，Elasticsearch 模块已经测试成功。

<p style="padding-right: 0.5em;padding-left: 0.5em;white-space: normal;text-align: center;background-color: rgb(255, 255, 255);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, "PingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;color: rgb(0, 0, 0);letter-spacing: 0.544px;font-size: 16px;">- END -

公众号后台回复「加群」加入一线高级工程师技术交流群，一起交流进步。推荐阅读

　　点亮，服务器三年不宕机

　　干货教程:mp4格式转换器与优采云万能文章采集器下载评论软件详情对比

　　优采云一款万能文章采集由优采云软件出品的软件，只需输入关键字即可采集各种网页和新闻，还可以采集指定列表页面（列页面）的文章。

　　注意：微信引擎有严格限制，请将采集线程数设置为1，否则很容易生成验证码。

　　特征：

　　1、依托优采云软件独有的通用文本识别智能算法，可自动提取任意网页文本，准确率达95%以上。

　　2.只要输入关键词，就可以采集到微信文章、今日头条、一点新闻、百度新闻和网页、搜狗新闻和网页、360新闻和网页、谷歌新闻和网页网页、必应新闻和网络、雅虎新闻和网络；批处理关键词自动采集。

　　3、网站栏目列表下的所有文章（如百度经验、*敏*感*词*）均可进行采集指定，智能匹配，无需编写复杂规则。

　　4、文章翻译功能可以将采集好的文章翻译成英文再翻译回中文，实现伪原创的翻译，支持谷歌和有道翻译。

　　5.史上最简单最聪明的文章采集器，更*敏*感*词*一试便知！

0

2022-10-22

自动文章采集

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

技术贴:一篇文章搞懂日志采集利器 Filebeat

0 个评论

发起人

AI时代内容工厂

技术贴:一篇文章搞懂日志采集利器 Filebeat

0 个评论

发起人

相关问题