CYBERSECURITY JOB HUNTING GUIDE
  • Home
  • Introduction
    • Things you should know
    • The strategy
  • Paths into Cybersecurity
    • First steps
    • SWOT Analysis
    • How much time do you need?
    • Calculate& Evaluate Knowledge
    • Imposter syndrome
    • Time Management
    • Cybersecurity Domains
    • Cloud Security
    • Financial advice >
      • Credit score
    • The salary
    • Advocacy for underrepresented groups
  • Goal Setting & Career paths
    • Find your career in 5 steps
    • Cybersecurity career options
    • Career finding with LinkedIn
    • Transferable Skills (general)
    • Transferable IT Skills
    • Find a path with job descriptions
    • The I do not know path
    • Do you know “garbage” jobs?
    • “Bonus” knowledge
    • Learning & Motivation
    • Particular vs. any job
    • Pentester path (start)
    • Pen Testing as Career
    • SOC Analyst as career
    • Security Engineer as career
    • Compliance & Risk as career
    • How to find a career (IAM Engineer)
    • Find a company
  • Networking
    • Networking like a pro
    • LinkedIn
    • Referrals & Skills
    • LinkedIn Recruiters >
      • Working with a recruiter
    • Cyber Community
    • Networking University
    • Mentoring
    • Build your personal brand
    • Goal of Networking
  • Hands-on
    • The home lab >
      • Designing a home lab
      • Ways to create a home lab
      • Hypervisors >
        • VirtualBox
        • VMWare Player
        • QEMU/KVM
      • Docker
      • Operating Systems >
        • Kali Linux >
          • Installing Kali with VirtualBox
        • Parrot
        • BlackArch
        • Red Hat Enterprise Linux >
          • RHCSA
        • Security Onion >
          • Installation Security Onion
        • Metasploitable2
        • Ubuntu
        • Windows >
          • Windows in a VM
          • Windows with Virtual Machine Manager
          • Preparing Windows logging
          • John Strand's ADHD VM
      • Firewalls >
        • pfSense Installation
        • pfSense configuration for Security Onion
    • Volunteer Work
    • Note Taking
    • Red labs >
      • Cyberseclabs
      • HackTheBox >
        • HackTheBox Academy
      • INE red side
      • RangeForce
      • Offensive Security
      • TryHackMe
      • Virtual Hacking Labs
    • Red tools & techniques >
      • Atomic Red Team
      • DVWA
      • Metasploit
      • OSINT tools
      • OWASP Juice Shop
    • Blue labs >
      • Blue Team Labs Online
      • DetectionLab (free)
      • INE
      • Letsdefend >
        • LetsDefend password stealer
      • Opensecuritytraining (free)
      • PurpleLabs
      • RangeForce
    • Blue tools >
      • Install a Canary Token
      • CyberChef
      • EDR Lima Charlie installation
      • EDR LimaCharlie configuration
      • EDR Velociraptor (free)
      • EDR Bluespawn (free)
      • DeepBlueCLI (logs Powershell, free)
      • Raccine (ransomware protection, free)
      • Install RITA (detects C2 traffic, free)
      • Sandboxes >
        • Joe's Sandbox
      • SIEM ELK Stack
      • SIEM Graylog >
        • Getting started with Graylog
        • Install Graylog
        • Graylog Windows agent
        • Graylog Linux agent
        • Graylog as application
      • Suricata with RangeForce
      • Identifying IoCs with RangeForce
      • What2Log
  • Certifications, Degree & Courses
    • Overview
    • Free & Affordable Resources
    • Pick your cert
    • Skill Assessment
    • Get a cheap degree
  • (Employment) fraud & scams
    • Suspicious Offer
    • Second Offer
    • Certification Scams
    • Fraud with courses
  • Analyzing a job ad
    • The Header
    • Building a Bridge
    • The Responsibilities
    • Desired Skills
    • Preferred Qualification
    • Benefits
    • Own skills vs job ad
    • Dealing with poorly written job ads
  • Resume writing
    • Templates
    • Building a draft
    • Resume in Detail
    • Understand the company
    • ATS and tailoring
    • Last Step
  • Cover letter
    • Writing a cover letter
  • Preparation & Interview
    • Organize your job hunt
    • SWOT Again (interview)
    • Twitter
    • The interview
    • Interview Questions Designed To Trick You
    • Post interview tasks
  • I did it all, but...
    • You are not alone
    • Try Something New
    • Why You'll Fail in Cyber Security
  • Yes, I got a job!
    • Two, or more offers?
    • Continued learning
    • Moving up
    • Lessons learned
  • Conclusion
  • Additional things
    • Reviews (labs, courses, certs) >
      • CompTIA A+
      • CompTIA Network+
      • CompTIA Security+
      • CompTIA Server+
      • CompTIA PenTest+
      • DroneSec DSOC
      • Defensive-Security Purple Labs
      • FAA Part 107
      • INE eCPPT & PTP
      • Letsdefend review
      • Microsoft AZ-500
      • RangeForce SOC 1
      • RangeForce SOC 2
    • Work In A Different Country >
      • The Work Permit
      • Working in the US
      • Studying in the US
      • Studying in Germany
      • Work in a different country
    • Other Resources >
      • Useful Links >
        • All about careers
        • Red resources
        • Blue resources
      • YouTube
      • Twitch
      • Podcasts
      • Books
      • Udemy
      • Thanks
    • Contributors
  • Stefan Waldvogel, where can I help?
  • Home
  • Introduction
    • Things you should know
    • The strategy
  • Paths into Cybersecurity
    • First steps
    • SWOT Analysis
    • How much time do you need?
    • Calculate& Evaluate Knowledge
    • Imposter syndrome
    • Time Management
    • Cybersecurity Domains
    • Cloud Security
    • Financial advice >
      • Credit score
    • The salary
    • Advocacy for underrepresented groups
  • Goal Setting & Career paths
    • Find your career in 5 steps
    • Cybersecurity career options
    • Career finding with LinkedIn
    • Transferable Skills (general)
    • Transferable IT Skills
    • Find a path with job descriptions
    • The I do not know path
    • Do you know “garbage” jobs?
    • “Bonus” knowledge
    • Learning & Motivation
    • Particular vs. any job
    • Pentester path (start)
    • Pen Testing as Career
    • SOC Analyst as career
    • Security Engineer as career
    • Compliance & Risk as career
    • How to find a career (IAM Engineer)
    • Find a company
  • Networking
    • Networking like a pro
    • LinkedIn
    • Referrals & Skills
    • LinkedIn Recruiters >
      • Working with a recruiter
    • Cyber Community
    • Networking University
    • Mentoring
    • Build your personal brand
    • Goal of Networking
  • Hands-on
    • The home lab >
      • Designing a home lab
      • Ways to create a home lab
      • Hypervisors >
        • VirtualBox
        • VMWare Player
        • QEMU/KVM
      • Docker
      • Operating Systems >
        • Kali Linux >
          • Installing Kali with VirtualBox
        • Parrot
        • BlackArch
        • Red Hat Enterprise Linux >
          • RHCSA
        • Security Onion >
          • Installation Security Onion
        • Metasploitable2
        • Ubuntu
        • Windows >
          • Windows in a VM
          • Windows with Virtual Machine Manager
          • Preparing Windows logging
          • John Strand's ADHD VM
      • Firewalls >
        • pfSense Installation
        • pfSense configuration for Security Onion
    • Volunteer Work
    • Note Taking
    • Red labs >
      • Cyberseclabs
      • HackTheBox >
        • HackTheBox Academy
      • INE red side
      • RangeForce
      • Offensive Security
      • TryHackMe
      • Virtual Hacking Labs
    • Red tools & techniques >
      • Atomic Red Team
      • DVWA
      • Metasploit
      • OSINT tools
      • OWASP Juice Shop
    • Blue labs >
      • Blue Team Labs Online
      • DetectionLab (free)
      • INE
      • Letsdefend >
        • LetsDefend password stealer
      • Opensecuritytraining (free)
      • PurpleLabs
      • RangeForce
    • Blue tools >
      • Install a Canary Token
      • CyberChef
      • EDR Lima Charlie installation
      • EDR LimaCharlie configuration
      • EDR Velociraptor (free)
      • EDR Bluespawn (free)
      • DeepBlueCLI (logs Powershell, free)
      • Raccine (ransomware protection, free)
      • Install RITA (detects C2 traffic, free)
      • Sandboxes >
        • Joe's Sandbox
      • SIEM ELK Stack
      • SIEM Graylog >
        • Getting started with Graylog
        • Install Graylog
        • Graylog Windows agent
        • Graylog Linux agent
        • Graylog as application
      • Suricata with RangeForce
      • Identifying IoCs with RangeForce
      • What2Log
  • Certifications, Degree & Courses
    • Overview
    • Free & Affordable Resources
    • Pick your cert
    • Skill Assessment
    • Get a cheap degree
  • (Employment) fraud & scams
    • Suspicious Offer
    • Second Offer
    • Certification Scams
    • Fraud with courses
  • Analyzing a job ad
    • The Header
    • Building a Bridge
    • The Responsibilities
    • Desired Skills
    • Preferred Qualification
    • Benefits
    • Own skills vs job ad
    • Dealing with poorly written job ads
  • Resume writing
    • Templates
    • Building a draft
    • Resume in Detail
    • Understand the company
    • ATS and tailoring
    • Last Step
  • Cover letter
    • Writing a cover letter
  • Preparation & Interview
    • Organize your job hunt
    • SWOT Again (interview)
    • Twitter
    • The interview
    • Interview Questions Designed To Trick You
    • Post interview tasks
  • I did it all, but...
    • You are not alone
    • Try Something New
    • Why You'll Fail in Cyber Security
  • Yes, I got a job!
    • Two, or more offers?
    • Continued learning
    • Moving up
    • Lessons learned
  • Conclusion
  • Additional things
    • Reviews (labs, courses, certs) >
      • CompTIA A+
      • CompTIA Network+
      • CompTIA Security+
      • CompTIA Server+
      • CompTIA PenTest+
      • DroneSec DSOC
      • Defensive-Security Purple Labs
      • FAA Part 107
      • INE eCPPT & PTP
      • Letsdefend review
      • Microsoft AZ-500
      • RangeForce SOC 1
      • RangeForce SOC 2
    • Work In A Different Country >
      • The Work Permit
      • Working in the US
      • Studying in the US
      • Studying in Germany
      • Work in a different country
    • Other Resources >
      • Useful Links >
        • All about careers
        • Red resources
        • Blue resources
      • YouTube
      • Twitch
      • Podcasts
      • Books
      • Udemy
      • Thanks
    • Contributors
  • Stefan Waldvogel, where can I help?
  CYBERSECURITY JOB HUNTING GUIDE

Graylog's pipeline rules

Author: Stefan Waldvogel

How to use regex, Grok, lookup tables with a pipeline rule

-under construction-
Overview
Graylog has many features to extract data out of a specific field. You can use extractors and pipelines. Extractors are a very quick way, but use CPU power. Pipelines are more efficient and the preferred way.

Requirements to test the code
- adjust the Message Processors Configuration to have the "Pipeline Processor" last on the order after Message Filter Chain
- you need a log file, or some kind of data
- you need to configure an input
- you need to send data to this input
- (in a real environment you might want to configure a stream to separate the data.)
- add a new pipeline, add a new connection (select your stream -> "All Messages" is the standard stream)
-> Manage rules 
-> create a rule.
If you have a rule with a name, attach it to a pipeline.
-> after attaching it, you can simulate the code

Helpful websites for regex
All commands are here:
docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
A regex builder:
regex101.com/
Graylog uses an own dialect, I will explain the differences later, but you can prebuild your code here.

The GUI
​The window for this feature looks like this:
Picture
  1. You find this window under System/Pipelines
  2. Add a description. Graylog does not allow comments in the rule field, use this instead.
  3. In this field you write the code, if you start it is empty.
  4. All special functions are here. These prebuild functions are very useful and you can extract/manipulate specific data in a faster way.
  5. Here is an example

General rule structure
Graylog uses a very easy structure:

rule
     true
then
     doe something
);
end

One example in detail
Goal for this example:
A incoming log file uses the message field and the logs could look this this:
<14>1 1234 Random unwanted stuff
<14>1 5432 Different random unwanted stuff 
<14>1 8745 More different random unwanted stuff 

-> For whatever reason, we need the red 4 digits after the static <14>1.
The code looks like this:
Picture
  1. The first line gives the rule name. It allows spaces. If you have the same rule for the different rule, and you save this rule, you see a red error at the bottom. The error does not give you a hint what is wrong.
  2. This is the condition and it is boolean. If it is true, the rule will applied.
  3. Here, the then field has three blocks. The first block with "let grep = regex....." This is a function, later, I will explain it in detail.
  4. With set_field I can create a new field, but there are a lot more options. Graylog offers dozends of different prebuild functions. For hour task, we only need the first set_field, the second field is a helper field to develop the code. I will explain it later.
  5. At the end, there needs to be an end.

The regex pattern in detail:
Picture
  1. Each expression in Graylog starts (and ends) with a ". If you do something wrong, you might see a different color or the color changes inside of your string. If you use a " in your string, you can do that if you escape it with  \".
  2. The ^ is the starting character and the interpreter starts at the beginning. Here, it is optional, the code would run without it.
  3. <14>1 is a static value. The regex interpreter will look for this pattern, and if the message has it, the interpreter moves on.
  4. \\s is a special character and means look for a whitespace character. If there is a space between <14>1 and the next thing, the interpreter moves on. If you write a code with regex101, you can use a real whitespace, but for Graylog you need \\s.
  5. This expression is complex and has multiple things in it. The most important thin*g are the brackets(). Every valid expression in a bracket goes into a variable/array. Here, the name of the variable is grep. Later, you can call the content with grep["0"]. It is a 0 because all arrays starts with 0 and it is the first bracket we have in this expression.
    The next part is [\\d]. The square bracket hold this structure together. The \\d is a special character and looks for digits.
    The {4} is an argument for [//d]. Here, it looks 4 times for a d. -> If the message has at this position at least four digits, copy the value into the grab variable. The variable looks like this {0:dddd}. The d represents the found digit.
  6. The \\s is again a whitespace. If the message has at least 4 digits and after is a whitespace, the interpreter moves on.
    Hint: the message could look like this: <14>1 12345 and it is still a valid extraction (it extracts 1234) because it has at least 4 fields. If the message is: <14>1 123 it is not valid and the interpreter stops. If we want to check for exact 4 digits, we have to adjust the code:
    "^<14>1\\s([\\d]{4})\\s.*?(.*?)$"
    This time, the contains a .*? and this special construction looks for an exact match. After 4 digits, a whitespace must appear. If not, the interpreter stops.
  7. This piece of code has a bracket and it puts all matching things into the variable grep. The .*? means everything until the end.
  8. The $ is the ending. The interpreter runs until the end of the message.  
  9. The last " closes the pattern.

​The improved code looks like this:
Picture
How to write a complex pattern?
Graylog's editor does have some small help, but it is very hard to find an error in a large regex expression. Writing a working code for 50 fields is very painful, but luckily  we can use external editors to prebuild the code. One option is:
regex101.com/
As mentioned before, Graylog's syntax is non-standard, but follows easy rules. We can write the code with regex101 and change some small things. If we add our code to regex101 it does not work and looks like this:
Picture
We have "no match" but if we change some small things, it will work:
Picture
For this code, I only changed 2 things:
- removed the \\s and set a whitespace
- removed one \ before the d
Other codes might have a " in the expression. regex101 does not need an escape, but Graylog needs a \". 
According to Graylog's forum, Graylog uses a simplified regex command set, -> not all possible structures and commands will work.

A different more complex example:
Graylog pattern: .*?:\\s*(.*?)\\sfrom\\s*([\\d\\.]+).*?to\\s*([\\d\\.]+).*?on\\s*interface\\s*(.*?)$
matching code: %ASA-4-400013 IPS:2003 ICMP redirect from 10.4.1.2 to 10.2.1.1 on interface dmz
Picture
To get it work on regex101, I substituted:
- all \\s with a whitespace
- deleted one \ before a \\d 

Example with a "
I am using the same teststring with two differences. I want to extract a field with " a borders, because many log files use this method.
​matching code: %ASA-4-400013 IPS"2003 ICMP redirect" from 10.4.1.2 to 10.2.1.1 on interface dmz
Picture
Graylog pattern: .*?\"\\s*(.*?)\"\\sfrom\\s*([\\d\\.]+).*?to\\s*([\\d\\.]+).*?on\\s*interface\\s*(.*?)$
Picture
If you want to include the both " we have to change the position for both () to:
Picture
Graylog's simulator shows, we changed the () to the right position:
Picture
A very useful construction is:
\\s*([\\d\\.]+).*?to
To understand it, we can modify the test message to:
Picture
The output looks like this:
Picture
The command does the following thing:
  1. \\* it looks for a whitespace. If it is true it moves on.
  2. * Take everything that follows
  3. The round bracket defines the area to extract
  4. \\dd\\. The interpreter looks for 2 digits followed by a .
  5. + The code adds all found dd. dd. combinations to the variable grep, here grep["1"].
  6. .*? is a stop and looks for the keyword: to

This is an IP address, but we can add a {n} to the code and it looks like this:
Picture
The result works, but only because 10.4.1.4 has 8 position. If the IP has less digits, the interpreter will not detect the IP, if the IP has over 8 positions, it will take the first 8. An IP with 100.100.10.3 looks like this:
Picture
How can we improve this code for an IP address?
The {} takes more than one argument, we can use a range, too. IP addresses can be between 1.1.1.1 and 255.255.255.255 -> 7 positions and 15. The important question is, what is the data source and can we trust the data? Regex can validate an IP address, but the statement gets very long. We want to use extract data out of a log file and we do not want to sanitize data. The next code shows two options to pick an IP address:
Picture
  1. It is a very short and efficient version
  2. It is a much longer format but it checks for 4 groups separated by a dot.
  3. If you want to validate the IP, you could use something like this: (([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
    -> We have the when field. Here we could use this regex to see if we have a real IP. If yes, we could do something else with this IP.

Extract fields after a specific string
Many log formats start with a specific string and it could look like this:
​headerinfo - - - - data1 unwanteddata
In this example, we want one field after - - - -. With regex101 we can write our first code.
Picture
Now, we can substitute all whitespace characters with \\s:
Picture
The result:
Picture
Extracting multiple similar fields
Sometimes logs contain fields without special characters. If we have such a log file, the pattern looks very simple. One example:
data1 data2 data3 "datawithbrackets" data5
Picture
This way is really cool, but Graylog's regex language does not support this pattern. The reason might be the array. This construct creates a weird array with null fields in it. Technically it looks like:
{0:data1, 1:NULL, 2:data2....}
Possible reasons:
  • Graylog's regex cannot insert NULL fields in an array
  • this kind of code is not supported
  • Graylog's regex needs for each group one matching code. 
Picture
Extract a date with a header
Extracting a date is a very common feature. Graylog has a build-in function for this, but we will do it with regex, first. On regex101, it looks like this:
Picture
First, we have a header-data what ends with - - - - followed by a date and other data. To get the string into Graylog, substitute all whirespace with \\s and add a \ before each d.
Right now, writing code in regex101 does not look very smart, but imagine you want to extract 50 fields. If you use regex101, you get a lot more data and you can change code very quickly.
Graylog has functions to modify time and date. Sometimes you get a date like 2021-12-28 but a database requires a full timestamp. Graylog has the function parse_date and we can use it.
Picture
The code for the output:
Picture
The result looks like this:
Picture
With this function, you can reformat time and date. This is useful if we have multiple different log sources, but in our database we want to work with a standard. If we have something like this"
20180227
the function parse_date can transform it into Graylog's standard time. The code for this example looks like this:
Picture
Picture
With the pattern, we can transform any date/time into our wanted format. Symantec's ProxySG log use:
yyyy-MM-dd HH:mm:ss
A syslog could look like this:
May 12 14:30:00 
and the matching pattern is: MMM dd HH:mm:ss
Sadly, syslog does not have a year and Graylog takes 2000 as standard. Technically we could take the timestamp field, extract the year and add it to the syslog file with the function concat.

Key value function
The function key_value is one of the most useful functions for logs and other files. A file or log could look like this:
car=Kia|color=perl_blue|type=sedan
or
IP=123.32.12.1|hostname="server 1"|location=NY
​The code is very simple:
Picture
The result:
Picture
Rename fields
The last task gave us some fields, but maybe IP is to generic. Is is a source IP or a destination ip? We can change the field name with the rename_field function.
Picture
Picture
Adding a new field without a function
Sometimes it could be useful to create a new field with a static input. We could say: This data is from a specific source.
Picture
One additional hint:
The field name does not accept a whitespace. You get an error if you try.

Add a comment
Some rules are large and you can add a comment. This help others to understand your code. Adding a comment is simple, it looks like this: 
Picture
Lookup tables
Lookup tables are usually use if a vendor has status codes with a specific meaning. Let us assume we have a SAN array and we pull the status codes. Possible status codes are:
x001  -> everything_ok
x001  -> high_temperature
x003 -> backplane_failure
x004  -> drive_error

and so on. Some of those lists are huge, but if there is an error, the log mentions: x003 and nothing else. For a human, this is not useful. With Graylog, we can substitute codes to messages. A detailed description is here: www.graylog.org/post/how-to-use-graylog-lookup-tables.
Hint:
If you follow the official guide, the code will not work with Graylog 4.09. The code is for an older version.
​
1. Create a lookup table as csv file. On Linux, the file needs root rights. The standard path is: /etc/graylog/lookup_error.csv and therefore the command is:
nano /etc/graylog/lookup_status.csv
-> You can use a different editor like vi if you want to.
Picture
2. Fill in the code and pick a separator and a quote character. I picked , and ". 
Picture
3. Go back to Graylog and open System -> Lookup Table -> Data Adapters -> Create data adapter -> csv file
4. Fill in the needed data:
Picture
5. We need to create a cache with ->Caches -> Create cache -> select Cache Type -> Node-local
Picture
6. Create the table with: Lookup tables -> Create lookup table
Picture
The preparations are done and we can write a matching rule with our lookup_status lookup table:
Picture
Graylog's rule interpreter changed, if you use an older Graylog version, use the "official" code. For this example, I used the message field, but if you have a real status code, this will not work. The message field contains usually more data and you have to extract the status_code first.
The result looks like this:
Picture
Working with RSYSLOG_SyslogProtocol23
Syslog has formats and Graylog works the best with the RSYSLOG_SyslogProtocol23. Graylog's testsender log_sender.py  sends an RSYSLOG header and we can analyze it as an example:
<14>1 2021-07-28T23:45:32:31.169035Z PYTHON_TEST_SENDER - - - -
  1. <14> is the PRI (priority). The calculation is PRI= (Facility Value * 8) + Severity Value. One example: The facility of mail system is 2 and the severity is 1 (alert). 2*8+1=17
  2. The 1 after <PRI> is the version number
  3. This is the timestamp
  4. The PYTON_TEST_SENDER is an example for the hostname. Others could look like this: mymachine.example.com
  5. The first - is a reservation for the app name. On Linux it could be "su"
  6. The second - is a reservation for PROCID. If it is -, it is unknown.
  7. The third - is a reservation for the MSGID (message id) and it could be: ID65
  8. The forth - is a reservation for STRUCTURED-DATA. Some messages might come with ID=12, Issue=overload and so on.
The real message comes next and it is the MSG (message) field. It is more or less unstructured data.
Picture
Each rsyslog header has the same build and we can use it to build a GROK patttern.

GROK patterns
One GROK patter is nothing more than a group of regular expressions. The huge advantage is, you can reuse these expressions. One example:
You want to validate an IP address with a regular expression. You could write:
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
and you repeat it every time if you want to validate an IP address. The code is huge and therefore Graylog offers a better way with predefined GROK patterns. You can find them under System -> Grok Patterns
Picture
Graylog's regex is long, very long. With Grok, the code to extract a valid IPv4 IP is relatively short:
Picture
The code looks still huge, but we can add extract more ip addresses in the same GROK pattern:
Picture
In my opinion, this code is weird. I do not understand, how the code work. ?: means capture everything enclosed. Sometimes the : is mandatory, sometimes not.
Attention:
​Grok function patterns do not follow Graylog's rule standard, a whitespace is a whitespace and not a \\s. If you have " as separator, you can escape the "  but it does not work. It might be a bug.
Picture
Grok with rsyslog header
Now, we have the basics and can go back to rsyslog, our message was:
<14>1 2021-07-28T23:45:32:31.169035Z PYTHON_TEST_SENDER - - - -

For <14> we do not have a grok pattern, we can build a pattern, but we have to know the possible range. In case of a kernel error with an emergency, the PRI is 0. The highest possible number is 23*8+7=191. We could validate this number, but it does not make sense. Instead, we look for 1 to 3 digits between <>. The code is simple:

<(\d{1,3})> 
Luckily for us, grok works with standard regex, we do not have to double escape the d.
Picture
The code looks like this:
Picture
Does it work? Yes!
Picture
The next is the version number and I found only the 1, version 2 might come in the future. To make is stable the code is:(\d). We can create a new Grok, but we could use a predefined grok pattern.
Picture
The next part is a ISO8601 timestamp, therefore we can reuse TIMESTAMP_ISO8601 or write a new grok pattern.
Picture
With the version and the date, the code looks like this:
Picture
The code gets longer and longer, but it is possible to stack different grok patterns in one single pattern. It looks like this:
Picture
We can shorten our code to:
Picture
The grok pattern includes a hard coded host name. This is okay for a test, but in the real world we have different names. This is also true for the next four fields and we can rewrite the LOG_SENDER grok pattern:
Picture
With this grok pattern (LOG_SENDER), we have the header for each RSYSLOG_SyslogProtocol23 formated log with one missing thing, the last msg field. This field is never the same and each vendor has different data in it. For this reason, it is the best to start a new grok pattern. We can reuse LOG_SENDER and for each data source we can write a new matching grok pattern.

How to build a grok/regex pattern for a huge log in a short amount of time
The msg field could look like this:

1812 2018-02-10 18:00:11 "DP1-DE1_ProxySG" 16310 174.52.62.87 adam - - OBSERVED "Search Engines/Portals" http://www.szlb.net/ 200 TCP_NC_MISS GET application/x-javascript http libs.baidu.com 80 /jquery/1.7.2/jquery.min.js - js "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Firefox/52.0" 192.168.1.2 95189 410 - - 0 "client" client_connector "Baidu Search" "-" 182.61.62.50 - - - - - none - - - - none - - ICAP_NOT_SCANNED - ICAP_NOT_SCANNED - - - - 9eef3983b1d826f3-00000000c3a345fd-000000005a7f331a

The data structure is:
#Fields: x-bluecoat-request-tenant-id date time x-bluecoat-appliance-name time-taken c-ip cs-userdn cs-auth-groups x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-data-leak-detected x-virus-id x-bluecoat-location-id x-bluecoat-location-name x-bluecoat-access-type x-bluecoat-application-name x-bluecoat-application-operation r-ip x-rs-certificate-validate-status x-rs-certificate-observed-errors x-cs-ocsp-error x-rs-ocsp-error x-rs-connection-negotiated-ssl-version x-rs-connection-negotiated-cipher x-rs-connection-negotiated-cipher-size x-rs-certificate-hostname x-rs-certificate-hostname-categories x-cs-connection-negotiated-ssl-version x-cs-connection-negotiated-cipher x-cs-connection-negotiated-cipher-size x-cs-certificate-subject cs-icap-status cs-icap-error-details rs-icap-status rs-icap-error-details x-cloud-rs x-bluecoat-placeholder cs(X-Requested-With) x-bluecoat-transaction-uuid

If we analyze the msg field, there are "easy" data fields in it and more complicated things like "Search Engines/Portals". This field has a whitespace in it and the limiter is a ". Other fields have special chars in it, like the user_Agent field.

My goal is to semi-automate the way to create a pattern. I am doing the following steps:
  1. I take one test msg field and remove all whitespace chars in a field with a "". Example: "Search Engines/Portals" becomes "SearchEngines/Portals". For this task, I have to remove whitespace because it is the delimiter. Other data might have an different delimiter.
  2. I import the data into a table. The goal is to have each field in one cell. Use the settings to pick the right delimiter, here it is a whitespace but it could be something else
  3. Create a name schema. Graylog uses specific field names (schema.graylog.org/en/development/schema/entities.html), but if you work in a company, you might pick a different schema. Two examples: the field c-ip is actually source_ip and cs-method is http_method. This BlueCoat ProxySG log has a lot of fields without a matching schema for Graylog. If I see something similar, I might use that or vendor_ or if it is very special, I use bcp_ (for BlueCoatProxy).
  4. Do we want to use/extract all fields? If we extract fields, all the information goes into an array. If we import millions of logs and we extract all information, we might have a problem with RAM and disk space.
  5. Build or use a matching grok/regex for each field with the matching name schema.
  6. Build the master grok pattern (MSG_PROXYSG)
  7. Add the master grok pattern to the code.

​
Other functions
A very useful function is "contains" and it looks like this:
Picture
Next: Suricata with RangeForce
© 2021. This work is licensed under a CC BY-SA 4.0 license​