CYBERSECURITY JOB HUNTING GUIDE
Graylog's pipeline rules
Author: Stefan Waldvogel
How to use regex, Grok, lookup tables with a pipeline rule
-under construction-
Overview
Graylog has many features to extract data out of a specific field. You can use extractors and pipelines. Extractors are a very quick way, but use CPU power. Pipelines are more efficient and the preferred way.
Requirements to test the code
- adjust the Message Processors Configuration to have the "Pipeline Processor" last on the order after Message Filter Chain
- you need a log file, or some kind of data
- you need to configure an input
- you need to send data to this input
- (in a real environment you might want to configure a stream to separate the data.)
- add a new pipeline, add a new connection (select your stream -> "All Messages" is the standard stream)
-> Manage rules
-> create a rule.
If you have a rule with a name, attach it to a pipeline.
-> after attaching it, you can simulate the code
Helpful websites for regex
All commands are here:
docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
A regex builder:
regex101.com/
Graylog uses an own dialect, I will explain the differences later, but you can prebuild your code here.
The GUI
The window for this feature looks like this:
Overview
Graylog has many features to extract data out of a specific field. You can use extractors and pipelines. Extractors are a very quick way, but use CPU power. Pipelines are more efficient and the preferred way.
Requirements to test the code
- adjust the Message Processors Configuration to have the "Pipeline Processor" last on the order after Message Filter Chain
- you need a log file, or some kind of data
- you need to configure an input
- you need to send data to this input
- (in a real environment you might want to configure a stream to separate the data.)
- add a new pipeline, add a new connection (select your stream -> "All Messages" is the standard stream)
-> Manage rules
-> create a rule.
If you have a rule with a name, attach it to a pipeline.
-> after attaching it, you can simulate the code
Helpful websites for regex
All commands are here:
docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
A regex builder:
regex101.com/
Graylog uses an own dialect, I will explain the differences later, but you can prebuild your code here.
The GUI
The window for this feature looks like this:
- You find this window under System/Pipelines
- Add a description. Graylog does not allow comments in the rule field, use this instead.
- In this field you write the code, if you start it is empty.
- All special functions are here. These prebuild functions are very useful and you can extract/manipulate specific data in a faster way.
- Here is an example
General rule structure
Graylog uses a very easy structure:
rule
true
then
doe something
);
end
One example in detail
Goal for this example:
A incoming log file uses the message field and the logs could look this this:
<14>1 1234 Random unwanted stuff
<14>1 5432 Different random unwanted stuff
<14>1 8745 More different random unwanted stuff
-> For whatever reason, we need the red 4 digits after the static <14>1.
The code looks like this:
- The first line gives the rule name. It allows spaces. If you have the same rule for the different rule, and you save this rule, you see a red error at the bottom. The error does not give you a hint what is wrong.
- This is the condition and it is boolean. If it is true, the rule will applied.
- Here, the then field has three blocks. The first block with "let grep = regex....." This is a function, later, I will explain it in detail.
- With set_field I can create a new field, but there are a lot more options. Graylog offers dozends of different prebuild functions. For hour task, we only need the first set_field, the second field is a helper field to develop the code. I will explain it later.
- At the end, there needs to be an end.
The regex pattern in detail:
- Each expression in Graylog starts (and ends) with a ". If you do something wrong, you might see a different color or the color changes inside of your string. If you use a " in your string, you can do that if you escape it with \".
- The ^ is the starting character and the interpreter starts at the beginning. Here, it is optional, the code would run without it.
- <14>1 is a static value. The regex interpreter will look for this pattern, and if the message has it, the interpreter moves on.
- \\s is a special character and means look for a whitespace character. If there is a space between <14>1 and the next thing, the interpreter moves on. If you write a code with regex101, you can use a real whitespace, but for Graylog you need \\s.
- This expression is complex and has multiple things in it. The most important thin*g are the brackets(). Every valid expression in a bracket goes into a variable/array. Here, the name of the variable is grep. Later, you can call the content with grep["0"]. It is a 0 because all arrays starts with 0 and it is the first bracket we have in this expression.
The next part is [\\d]. The square bracket hold this structure together. The \\d is a special character and looks for digits.
The {4} is an argument for [//d]. Here, it looks 4 times for a d. -> If the message has at this position at least four digits, copy the value into the grab variable. The variable looks like this {0:dddd}. The d represents the found digit. - The \\s is again a whitespace. If the message has at least 4 digits and after is a whitespace, the interpreter moves on.
Hint: the message could look like this: <14>1 12345 and it is still a valid extraction (it extracts 1234) because it has at least 4 fields. If the message is: <14>1 123 it is not valid and the interpreter stops. If we want to check for exact 4 digits, we have to adjust the code:
"^<14>1\\s([\\d]{4})\\s.*?(.*?)$"
This time, the contains a .*? and this special construction looks for an exact match. After 4 digits, a whitespace must appear. If not, the interpreter stops. - This piece of code has a bracket and it puts all matching things into the variable grep. The .*? means everything until the end.
- The $ is the ending. The interpreter runs until the end of the message.
- The last " closes the pattern.
The improved code looks like this:
How to write a complex pattern?
Graylog's editor does have some small help, but it is very hard to find an error in a large regex expression. Writing a working code for 50 fields is very painful, but luckily we can use external editors to prebuild the code. One option is:
regex101.com/
As mentioned before, Graylog's syntax is non-standard, but follows easy rules. We can write the code with regex101 and change some small things. If we add our code to regex101 it does not work and looks like this:
Graylog's editor does have some small help, but it is very hard to find an error in a large regex expression. Writing a working code for 50 fields is very painful, but luckily we can use external editors to prebuild the code. One option is:
regex101.com/
As mentioned before, Graylog's syntax is non-standard, but follows easy rules. We can write the code with regex101 and change some small things. If we add our code to regex101 it does not work and looks like this:
We have "no match" but if we change some small things, it will work:
For this code, I only changed 2 things:
- removed the \\s and set a whitespace
- removed one \ before the d
Other codes might have a " in the expression. regex101 does not need an escape, but Graylog needs a \".
According to Graylog's forum, Graylog uses a simplified regex command set, -> not all possible structures and commands will work.
A different more complex example:
Graylog pattern: .*?:\\s*(.*?)\\sfrom\\s*([\\d\\.]+).*?to\\s*([\\d\\.]+).*?on\\s*interface\\s*(.*?)$
matching code: %ASA-4-400013 IPS:2003 ICMP redirect from 10.4.1.2 to 10.2.1.1 on interface dmz
- removed the \\s and set a whitespace
- removed one \ before the d
Other codes might have a " in the expression. regex101 does not need an escape, but Graylog needs a \".
According to Graylog's forum, Graylog uses a simplified regex command set, -> not all possible structures and commands will work.
A different more complex example:
Graylog pattern: .*?:\\s*(.*?)\\sfrom\\s*([\\d\\.]+).*?to\\s*([\\d\\.]+).*?on\\s*interface\\s*(.*?)$
matching code: %ASA-4-400013 IPS:2003 ICMP redirect from 10.4.1.2 to 10.2.1.1 on interface dmz
To get it work on regex101, I substituted:
- all \\s with a whitespace
- deleted one \ before a \\d
Example with a "
I am using the same teststring with two differences. I want to extract a field with " a borders, because many log files use this method.
matching code: %ASA-4-400013 IPS"2003 ICMP redirect" from 10.4.1.2 to 10.2.1.1 on interface dmz
- all \\s with a whitespace
- deleted one \ before a \\d
Example with a "
I am using the same teststring with two differences. I want to extract a field with " a borders, because many log files use this method.
matching code: %ASA-4-400013 IPS"2003 ICMP redirect" from 10.4.1.2 to 10.2.1.1 on interface dmz
Graylog pattern: .*?\"\\s*(.*?)\"\\sfrom\\s*([\\d\\.]+).*?to\\s*([\\d\\.]+).*?on\\s*interface\\s*(.*?)$
If you want to include the both " we have to change the position for both () to:
Graylog's simulator shows, we changed the () to the right position:
A very useful construction is:
\\s*([\\d\\.]+).*?to
To understand it, we can modify the test message to:
\\s*([\\d\\.]+).*?to
To understand it, we can modify the test message to:
The output looks like this:
The command does the following thing:
This is an IP address, but we can add a {n} to the code and it looks like this:
- \\* it looks for a whitespace. If it is true it moves on.
- * Take everything that follows
- The round bracket defines the area to extract
- \\dd\\. The interpreter looks for 2 digits followed by a .
- + The code adds all found dd. dd. combinations to the variable grep, here grep["1"].
- .*? is a stop and looks for the keyword: to
This is an IP address, but we can add a {n} to the code and it looks like this:
The result works, but only because 10.4.1.4 has 8 position. If the IP has less digits, the interpreter will not detect the IP, if the IP has over 8 positions, it will take the first 8. An IP with 100.100.10.3 looks like this:
How can we improve this code for an IP address?
The {} takes more than one argument, we can use a range, too. IP addresses can be between 1.1.1.1 and 255.255.255.255 -> 7 positions and 15. The important question is, what is the data source and can we trust the data? Regex can validate an IP address, but the statement gets very long. We want to use extract data out of a log file and we do not want to sanitize data. The next code shows two options to pick an IP address:
The {} takes more than one argument, we can use a range, too. IP addresses can be between 1.1.1.1 and 255.255.255.255 -> 7 positions and 15. The important question is, what is the data source and can we trust the data? Regex can validate an IP address, but the statement gets very long. We want to use extract data out of a log file and we do not want to sanitize data. The next code shows two options to pick an IP address:
- It is a very short and efficient version
- It is a much longer format but it checks for 4 groups separated by a dot.
- If you want to validate the IP, you could use something like this: (([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
-> We have the when field. Here we could use this regex to see if we have a real IP. If yes, we could do something else with this IP.
Extract fields after a specific string
Many log formats start with a specific string and it could look like this:
headerinfo - - - - data1 unwanteddata
In this example, we want one field after - - - -. With regex101 we can write our first code.
Now, we can substitute all whitespace characters with \\s:
The result:
Extracting multiple similar fields
Sometimes logs contain fields without special characters. If we have such a log file, the pattern looks very simple. One example:
data1 data2 data3 "datawithbrackets" data5
Sometimes logs contain fields without special characters. If we have such a log file, the pattern looks very simple. One example:
data1 data2 data3 "datawithbrackets" data5
This way is really cool, but Graylog's regex language does not support this pattern. The reason might be the array. This construct creates a weird array with null fields in it. Technically it looks like:
{0:data1, 1:NULL, 2:data2....}
Possible reasons:
{0:data1, 1:NULL, 2:data2....}
Possible reasons:
- Graylog's regex cannot insert NULL fields in an array
- this kind of code is not supported
- Graylog's regex needs for each group one matching code.
Extract a date with a header
Extracting a date is a very common feature. Graylog has a build-in function for this, but we will do it with regex, first. On regex101, it looks like this:
Extracting a date is a very common feature. Graylog has a build-in function for this, but we will do it with regex, first. On regex101, it looks like this:
First, we have a header-data what ends with - - - - followed by a date and other data. To get the string into Graylog, substitute all whirespace with \\s and add a \ before each d.
Right now, writing code in regex101 does not look very smart, but imagine you want to extract 50 fields. If you use regex101, you get a lot more data and you can change code very quickly.
Graylog has functions to modify time and date. Sometimes you get a date like 2021-12-28 but a database requires a full timestamp. Graylog has the function parse_date and we can use it.
Right now, writing code in regex101 does not look very smart, but imagine you want to extract 50 fields. If you use regex101, you get a lot more data and you can change code very quickly.
Graylog has functions to modify time and date. Sometimes you get a date like 2021-12-28 but a database requires a full timestamp. Graylog has the function parse_date and we can use it.
The code for the output:
The result looks like this:
With this function, you can reformat time and date. This is useful if we have multiple different log sources, but in our database we want to work with a standard. If we have something like this"
20180227
the function parse_date can transform it into Graylog's standard time. The code for this example looks like this:
20180227
the function parse_date can transform it into Graylog's standard time. The code for this example looks like this:
With the pattern, we can transform any date/time into our wanted format. Symantec's ProxySG log use:
yyyy-MM-dd HH:mm:ss
A syslog could look like this:
May 12 14:30:00
and the matching pattern is: MMM dd HH:mm:ss
Sadly, syslog does not have a year and Graylog takes 2000 as standard. Technically we could take the timestamp field, extract the year and add it to the syslog file with the function concat.
Key value function
The function key_value is one of the most useful functions for logs and other files. A file or log could look like this:
car=Kia|color=perl_blue|type=sedan
or
IP=123.32.12.1|hostname="server 1"|location=NY
The code is very simple:
yyyy-MM-dd HH:mm:ss
A syslog could look like this:
May 12 14:30:00
and the matching pattern is: MMM dd HH:mm:ss
Sadly, syslog does not have a year and Graylog takes 2000 as standard. Technically we could take the timestamp field, extract the year and add it to the syslog file with the function concat.
Key value function
The function key_value is one of the most useful functions for logs and other files. A file or log could look like this:
car=Kia|color=perl_blue|type=sedan
or
IP=123.32.12.1|hostname="server 1"|location=NY
The code is very simple:
The result:
Rename fields
The last task gave us some fields, but maybe IP is to generic. Is is a source IP or a destination ip? We can change the field name with the rename_field function.
The last task gave us some fields, but maybe IP is to generic. Is is a source IP or a destination ip? We can change the field name with the rename_field function.
Adding a new field without a function
Sometimes it could be useful to create a new field with a static input. We could say: This data is from a specific source.
Sometimes it could be useful to create a new field with a static input. We could say: This data is from a specific source.
One additional hint:
The field name does not accept a whitespace. You get an error if you try.
Add a comment
Some rules are large and you can add a comment. This help others to understand your code. Adding a comment is simple, it looks like this:
The field name does not accept a whitespace. You get an error if you try.
Add a comment
Some rules are large and you can add a comment. This help others to understand your code. Adding a comment is simple, it looks like this:
Lookup tables
Lookup tables are usually use if a vendor has status codes with a specific meaning. Let us assume we have a SAN array and we pull the status codes. Possible status codes are:
x001 -> everything_ok
x001 -> high_temperature
x003 -> backplane_failure
x004 -> drive_error
and so on. Some of those lists are huge, but if there is an error, the log mentions: x003 and nothing else. For a human, this is not useful. With Graylog, we can substitute codes to messages. A detailed description is here: www.graylog.org/post/how-to-use-graylog-lookup-tables.
Hint:
If you follow the official guide, the code will not work with Graylog 4.09. The code is for an older version.
1. Create a lookup table as csv file. On Linux, the file needs root rights. The standard path is: /etc/graylog/lookup_error.csv and therefore the command is:
nano /etc/graylog/lookup_status.csv
-> You can use a different editor like vi if you want to.
Lookup tables are usually use if a vendor has status codes with a specific meaning. Let us assume we have a SAN array and we pull the status codes. Possible status codes are:
x001 -> everything_ok
x001 -> high_temperature
x003 -> backplane_failure
x004 -> drive_error
and so on. Some of those lists are huge, but if there is an error, the log mentions: x003 and nothing else. For a human, this is not useful. With Graylog, we can substitute codes to messages. A detailed description is here: www.graylog.org/post/how-to-use-graylog-lookup-tables.
Hint:
If you follow the official guide, the code will not work with Graylog 4.09. The code is for an older version.
1. Create a lookup table as csv file. On Linux, the file needs root rights. The standard path is: /etc/graylog/lookup_error.csv and therefore the command is:
nano /etc/graylog/lookup_status.csv
-> You can use a different editor like vi if you want to.
2. Fill in the code and pick a separator and a quote character. I picked , and ".
3. Go back to Graylog and open System -> Lookup Table -> Data Adapters -> Create data adapter -> csv file
4. Fill in the needed data:
4. Fill in the needed data:
5. We need to create a cache with ->Caches -> Create cache -> select Cache Type -> Node-local
6. Create the table with: Lookup tables -> Create lookup table
The preparations are done and we can write a matching rule with our lookup_status lookup table:
Graylog's rule interpreter changed, if you use an older Graylog version, use the "official" code. For this example, I used the message field, but if you have a real status code, this will not work. The message field contains usually more data and you have to extract the status_code first.
The result looks like this:
The result looks like this:
Working with RSYSLOG_SyslogProtocol23
Syslog has formats and Graylog works the best with the RSYSLOG_SyslogProtocol23. Graylog's testsender log_sender.py sends an RSYSLOG header and we can analyze it as an example:
<14>1 2021-07-28T23:45:32:31.169035Z PYTHON_TEST_SENDER - - - -
Syslog has formats and Graylog works the best with the RSYSLOG_SyslogProtocol23. Graylog's testsender log_sender.py sends an RSYSLOG header and we can analyze it as an example:
<14>1 2021-07-28T23:45:32:31.169035Z PYTHON_TEST_SENDER - - - -
- <14> is the PRI (priority). The calculation is PRI= (Facility Value * 8) + Severity Value. One example: The facility of mail system is 2 and the severity is 1 (alert). 2*8+1=17
- The 1 after <PRI> is the version number
- This is the timestamp
- The PYTON_TEST_SENDER is an example for the hostname. Others could look like this: mymachine.example.com
- The first - is a reservation for the app name. On Linux it could be "su"
- The second - is a reservation for PROCID. If it is -, it is unknown.
- The third - is a reservation for the MSGID (message id) and it could be: ID65
- The forth - is a reservation for STRUCTURED-DATA. Some messages might come with ID=12, Issue=overload and so on.
Each rsyslog header has the same build and we can use it to build a GROK patttern.
GROK patterns
One GROK patter is nothing more than a group of regular expressions. The huge advantage is, you can reuse these expressions. One example:
You want to validate an IP address with a regular expression. You could write:
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
and you repeat it every time if you want to validate an IP address. The code is huge and therefore Graylog offers a better way with predefined GROK patterns. You can find them under System -> Grok Patterns
GROK patterns
One GROK patter is nothing more than a group of regular expressions. The huge advantage is, you can reuse these expressions. One example:
You want to validate an IP address with a regular expression. You could write:
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
and you repeat it every time if you want to validate an IP address. The code is huge and therefore Graylog offers a better way with predefined GROK patterns. You can find them under System -> Grok Patterns
Graylog's regex is long, very long. With Grok, the code to extract a valid IPv4 IP is relatively short:
The code looks still huge, but we can add extract more ip addresses in the same GROK pattern:
In my opinion, this code is weird. I do not understand, how the code work. ?: means capture everything enclosed. Sometimes the : is mandatory, sometimes not.
Attention:
Grok function patterns do not follow Graylog's rule standard, a whitespace is a whitespace and not a \\s. If you have " as separator, you can escape the " but it does not work. It might be a bug.
Attention:
Grok function patterns do not follow Graylog's rule standard, a whitespace is a whitespace and not a \\s. If you have " as separator, you can escape the " but it does not work. It might be a bug.
Grok with rsyslog header
Now, we have the basics and can go back to rsyslog, our message was:
<14>1 2021-07-28T23:45:32:31.169035Z PYTHON_TEST_SENDER - - - -
For <14> we do not have a grok pattern, we can build a pattern, but we have to know the possible range. In case of a kernel error with an emergency, the PRI is 0. The highest possible number is 23*8+7=191. We could validate this number, but it does not make sense. Instead, we look for 1 to 3 digits between <>. The code is simple:
<(\d{1,3})>
Luckily for us, grok works with standard regex, we do not have to double escape the d.
Now, we have the basics and can go back to rsyslog, our message was:
<14>1 2021-07-28T23:45:32:31.169035Z PYTHON_TEST_SENDER - - - -
For <14> we do not have a grok pattern, we can build a pattern, but we have to know the possible range. In case of a kernel error with an emergency, the PRI is 0. The highest possible number is 23*8+7=191. We could validate this number, but it does not make sense. Instead, we look for 1 to 3 digits between <>. The code is simple:
<(\d{1,3})>
Luckily for us, grok works with standard regex, we do not have to double escape the d.
The code looks like this:
Does it work? Yes!
The next is the version number and I found only the 1, version 2 might come in the future. To make is stable the code is:(\d). We can create a new Grok, but we could use a predefined grok pattern.
The next part is a ISO8601 timestamp, therefore we can reuse TIMESTAMP_ISO8601 or write a new grok pattern.
With the version and the date, the code looks like this:
The code gets longer and longer, but it is possible to stack different grok patterns in one single pattern. It looks like this:
We can shorten our code to:
The grok pattern includes a hard coded host name. This is okay for a test, but in the real world we have different names. This is also true for the next four fields and we can rewrite the LOG_SENDER grok pattern:
With this grok pattern (LOG_SENDER), we have the header for each RSYSLOG_SyslogProtocol23 formated log with one missing thing, the last msg field. This field is never the same and each vendor has different data in it. For this reason, it is the best to start a new grok pattern. We can reuse LOG_SENDER and for each data source we can write a new matching grok pattern.
How to build a grok/regex pattern for a huge log in a short amount of time
The msg field could look like this:
1812 2018-02-10 18:00:11 "DP1-DE1_ProxySG" 16310 174.52.62.87 adam - - OBSERVED "Search Engines/Portals" http://www.szlb.net/ 200 TCP_NC_MISS GET application/x-javascript http libs.baidu.com 80 /jquery/1.7.2/jquery.min.js - js "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Firefox/52.0" 192.168.1.2 95189 410 - - 0 "client" client_connector "Baidu Search" "-" 182.61.62.50 - - - - - none - - - - none - - ICAP_NOT_SCANNED - ICAP_NOT_SCANNED - - - - 9eef3983b1d826f3-00000000c3a345fd-000000005a7f331a
The data structure is:
#Fields: x-bluecoat-request-tenant-id date time x-bluecoat-appliance-name time-taken c-ip cs-userdn cs-auth-groups x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-data-leak-detected x-virus-id x-bluecoat-location-id x-bluecoat-location-name x-bluecoat-access-type x-bluecoat-application-name x-bluecoat-application-operation r-ip x-rs-certificate-validate-status x-rs-certificate-observed-errors x-cs-ocsp-error x-rs-ocsp-error x-rs-connection-negotiated-ssl-version x-rs-connection-negotiated-cipher x-rs-connection-negotiated-cipher-size x-rs-certificate-hostname x-rs-certificate-hostname-categories x-cs-connection-negotiated-ssl-version x-cs-connection-negotiated-cipher x-cs-connection-negotiated-cipher-size x-cs-certificate-subject cs-icap-status cs-icap-error-details rs-icap-status rs-icap-error-details x-cloud-rs x-bluecoat-placeholder cs(X-Requested-With) x-bluecoat-transaction-uuid
If we analyze the msg field, there are "easy" data fields in it and more complicated things like "Search Engines/Portals". This field has a whitespace in it and the limiter is a ". Other fields have special chars in it, like the user_Agent field.
My goal is to semi-automate the way to create a pattern. I am doing the following steps:
Other functions
A very useful function is "contains" and it looks like this:
How to build a grok/regex pattern for a huge log in a short amount of time
The msg field could look like this:
1812 2018-02-10 18:00:11 "DP1-DE1_ProxySG" 16310 174.52.62.87 adam - - OBSERVED "Search Engines/Portals" http://www.szlb.net/ 200 TCP_NC_MISS GET application/x-javascript http libs.baidu.com 80 /jquery/1.7.2/jquery.min.js - js "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Firefox/52.0" 192.168.1.2 95189 410 - - 0 "client" client_connector "Baidu Search" "-" 182.61.62.50 - - - - - none - - - - none - - ICAP_NOT_SCANNED - ICAP_NOT_SCANNED - - - - 9eef3983b1d826f3-00000000c3a345fd-000000005a7f331a
The data structure is:
#Fields: x-bluecoat-request-tenant-id date time x-bluecoat-appliance-name time-taken c-ip cs-userdn cs-auth-groups x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-data-leak-detected x-virus-id x-bluecoat-location-id x-bluecoat-location-name x-bluecoat-access-type x-bluecoat-application-name x-bluecoat-application-operation r-ip x-rs-certificate-validate-status x-rs-certificate-observed-errors x-cs-ocsp-error x-rs-ocsp-error x-rs-connection-negotiated-ssl-version x-rs-connection-negotiated-cipher x-rs-connection-negotiated-cipher-size x-rs-certificate-hostname x-rs-certificate-hostname-categories x-cs-connection-negotiated-ssl-version x-cs-connection-negotiated-cipher x-cs-connection-negotiated-cipher-size x-cs-certificate-subject cs-icap-status cs-icap-error-details rs-icap-status rs-icap-error-details x-cloud-rs x-bluecoat-placeholder cs(X-Requested-With) x-bluecoat-transaction-uuid
If we analyze the msg field, there are "easy" data fields in it and more complicated things like "Search Engines/Portals". This field has a whitespace in it and the limiter is a ". Other fields have special chars in it, like the user_Agent field.
My goal is to semi-automate the way to create a pattern. I am doing the following steps:
- I take one test msg field and remove all whitespace chars in a field with a "". Example: "Search Engines/Portals" becomes "SearchEngines/Portals". For this task, I have to remove whitespace because it is the delimiter. Other data might have an different delimiter.
- I import the data into a table. The goal is to have each field in one cell. Use the settings to pick the right delimiter, here it is a whitespace but it could be something else
- Create a name schema. Graylog uses specific field names (schema.graylog.org/en/development/schema/entities.html), but if you work in a company, you might pick a different schema. Two examples: the field c-ip is actually source_ip and cs-method is http_method. This BlueCoat ProxySG log has a lot of fields without a matching schema for Graylog. If I see something similar, I might use that or vendor_ or if it is very special, I use bcp_ (for BlueCoatProxy).
- Do we want to use/extract all fields? If we extract fields, all the information goes into an array. If we import millions of logs and we extract all information, we might have a problem with RAM and disk space.
- Build or use a matching grok/regex for each field with the matching name schema.
- Build the master grok pattern (MSG_PROXYSG)
- Add the master grok pattern to the code.
Other functions
A very useful function is "contains" and it looks like this:
© 2021. This work is licensed under a CC BY-SA 4.0 license