Making a long-running scripted input

Sometimes, a process needs to be long-running, for instance, if it is polling an external source, such as a database. A simple example might be as follows:

import time 
import random 
import sys 
for i in range(1, 1000): 
print "%s Hello." % time.strftime('%Y-%m-%dT%H:%M:%S') 
#make sure python actually sends the output 
sys.stdout.flush() 
time.sleep(random.randint(1, 5))

This script will run for somewhere between 1,000 and 5,000 seconds and then exit.

Since this is a long-running script, our choices are either to treat each line as an event as we did in the Capturing script output with no date section, or, if we know that there is a date to use, configure the input like a regular log file. In this case, we can see that there is always a date, so we will rely on that. The output is, unsurprisingly, as follows:

2012-10-30T20:13:29 Hello. 
2012-10-30T20:13:33 Hello. 
2012-10-30T20:13:36 Hello.

The inputs.conf file should contain something similar to the following code:

[script://./bin/long_running.py] 
interval = 1 
sourcetype=long_running

With interval = 1, Splunk will try to launch the script every second but will only run one copy of the script at a time.

The props.conf file should then contain something similar to the following code:

[long_running] 
TIME_FORMAT = %Y-%m-%dT%H:%M:%S 
MAX_TIMESTAMP_LOOKAHEAD = 21 
BREAK_ONLY_BEFORE = ^d{4}-d{1,2}-d{1,2}Td{1,2}:

This will create a long-running process that can do whatever is appropriate.

Even though it is convenient to have Splunk execute scripts for you and capture the output, if the information you are capturing is vital, it may be safer to simply schedule the script with cron, direct its output to a file, and point Splunk at that file. This allows you to use the file in other ways; you can capture both standard output and errors, and the data will still be captured if Splunk is down. It, however, has the disadvantage that you have to clean up those logs yourself.