Transforming data

Until now, our commands have returned the original events with modifications to their fields. Commands can also transform data, much like the built-in functions top and stats. Let's write a function to count the words in our events. You can find this example in ImplementingSplunkExtendingExamples/bin/countwords.py:

import splunk.Intersplunk as si 
import re 
import operator 
from collections import defaultdict 
#create a class that does the actual work 
class WordCounter: 
word_counts = defaultdict(int) 
unique_word_counts = defaultdict(int) 
rowcount = 0 
casesensitive = False 
mincount = 50 
minwordlength = 3 
def process_event(self, input): 
self.rowcount += 1 
words_in_event = re.findall('W*([a-zA-Z]+)W*', input) 
unique_words_in_event = set() 
for word in words_in_event: 
if len(word) < self.minwordlength: 
continue # skip this word, it's too short 
if not self.casesensitive: 
word = word.lower() 
self.word_counts[word] += 1 
unique_words_in_event.add(word) 
for word in unique_words_in_event: 
self.unique_word_counts[word] += 1 
def build_sorted_counts(self): 
#create an array of tuples, 
#ordered by the count for each word 
sorted_counts = sorted(self.word_counts.iteritems(), 
key=operator.itemgetter(1)) 
#reverse it 
sorted_counts.reverse() 
return sorted_counts 
def build_rows(self): 
#build our results, which must be a list of dict 
count_rows = [] 
for word, count in self.build_sorted_counts(): 
if self.mincount < 1 or count >= self.mincount: 
unique = self.unique_word_counts.get(word, 0) 
percent = round(100.0 * unique / self.rowcount, 2) 
newrow = {'word': word, 
'count': str(count), 
'Events with word': str(unique), 
'Event count': str(self.rowcount), 
'Percent of events with word': 
str(percent)} 
count_rows.append(newrow) 
return count_rows 
#a helper method that doesn't really belong in the class 
#return an integer from an option, or raise useful Exception 
def getInt(options, field, default): 
try: 
return int(options.get(field, default)) 
except Exception, e: 
#raise a user friendly exception 
raise Exception("%s must be an integer" % field) 
#our main method, which reads the options, creates a WordCounter 
#instance, and loops over the results 
if __name__ == '__main__': 
try: 
#get our results 
results, dummyresults, settings = si.getOrganizedResults() 
keywords, options = si.getKeywordsAndOptions() 
word_counter = WordCounter() 
word_counter.mincount = getInt(options, 'mincount', 50) 
word_counter.minwordlength = getInt(options, 
'minwordlength', 3) 
#determine whether we should be case sensitive 
casesensitive = options.get('casesensitive', False) 
if casesensitive: 
casesensitive = (casesensitive.lower().strip() in 
['t', 'true', '1', 'y', 'yes']) 
word_counter.casesensitive = casesensitive 
#loop through the original results 
for r in results: 
word_counter.process_event(r['_raw']) 
output = word_counter.build_rows() 
si.outputResults(output) 
#catch the exception and show the error to the user 
except Exception, e: 
import traceback 
stack = traceback.format_exc() 
si.generateErrorResults("Error '%s'. %s" % (e, stack))

This is a larger script, but hopefully, what is happening is clear to you. Note a few new things in this example:

Most of the logic is in the class definition. This provides a better separation of Splunk-specific logic and business logic.
Testing for __main__ is done as is the Python way.
Exception handling is also done.
This is a nicer exception for failed parsing of integer arguments.
Here, there are field names with spaces in them.

Our entry in commands.conf does not allow streaming and does not retain events:

[countwords] 
filename = countwords.py 
retainsevents = false 
streaming = false

We can then use our command as follows:

* | countwords

This will give us a table, as shown in the following screenshot:

With my test data, this produced 132 rows, representing 132 unique words at least three characters long in my not-so-random dataset. count represents how many times each word occurred overall, while Events with word represents how many events contained the word at all.

Note the value 50000 in the Event count column. Even though my query found more than 300,000 events, only 50,000 events made their way to the command. You can increase this limit by increasing maxresultrows in limits.conf, but be careful! This limit is for your protection.

Try out our options as follows:

* | head 1000 | countwords casesensitive=true mincount=250 minwordlength=0

This query produces the following output:

Note that we now see one- and two-letter words, with entries for both T and t, and our results stop when count drops below our value for mincount.

Just for completeness, to accomplish this command using built-in commands, you could write something similar to the following code:

* | rex max_match=1000 "W*(?<word>[a-zA-Z]+)W*" 
| eval id=1 | accum id | fields word id 
| eventstats count 
| mvexpand word 
| eval word=lower(word) 
| stats max(count) as event_count dc(id) as events_with_word count as word_count by word 
| sort -events_with_word 
| eval percent_events_containing = round(events_with_word/event_count*100.0,2) 
| rename word_count as count 
events_with_word as "Events with word" 
event_count as "Event count" 
percent_events_containing as "Percent of events with word" 
| table count "Events with word" word 
"Event count" "Percent of events with word"

There is probably a more efficient way to do this work using built-in commands, but this is what comes to mind initially.