Advent of code in SPL - Day 5
- Gabriel Vasseur
- 42 minutes ago
- 5 min read
Note: I brought up these challenges to the puzzle channel on the Splunk Community slack a few weeks ago, and I'm delighted to see that ITWhisperer has now started his own series about it! His day 1 is here. I encourage you to check out his solutions as it's great to see different ways of thinking!
Day 5 is here.
For this one, my thoughts went to lookups. The idea would be to create a lookup of fresh ingredients. Then for each ingredient all we have to do is check if it's in the lookup. I thought it was going to be extremely easy... We'll build a few searches for this one, at the same time, so you'll need to keep a few tabs going. I've added a label in the top right corner of the SPL (e.g. "lookup-builder") so you can keep track of which search we're working on.
Let's start with the first sample data range, 3-5, we can use mvexpand and makecontinuous to enumerate all of its IDs:
And we can expand this to the whole sample data with map and save it to a lookup:
Then we can take the sample data ingredient list:
And lookup each ingredient to see if it's fresh:
And we can just count the number of fresh ingredients:
Easy, and we're leveraging the features of Splunk.
The trouble is again in the sheer scale of the full data. But first we have is to separate the list of ID ranges from the list of ingredient IDs. I'll use streamstats to figure out how where the blank line is, and eventstats to propagate this information to all the events:
Now we can look at the sheer scale:
In our data, we have 174 ranges, the biggest range has 8,407,584,579,256 elements (8 trillion!) and the biggest ID is 559 trillion! I think we're going to hit a limit or two, not to mention the sheer disk space needed for our lookup...
So we need to pivot. Because we started with lookups in mind, I want to continue in this direction. This is not where you would go if you tackled this challenge in a proper programming language, but it's definitely a valid way of thinking with splunk. The question is: Could we use a timed lookup?
Timed lookup
A timed lookup is a lookup that has a _time field, on top of the fields you want to lookup. Say you have a feed of DHCP data that shows when a DHCP IP is assigned to a hostname:
07:00 10.0.0.1 assigned to GabsLaptop 08:00 10.0.0.2 assigned to BossLaptop 09:00 10.0.0.3 assigned to GabsLaptop 09:30 10.0.0.1 assigned to LabLaptop
If you have a timed lookup with this information, you can now lookup the ip 10.0.0.1 and depending on the content of the _time field it will return either nothing, or GabsLaptop, or LabLaptop. This is very different from a lookup that only keeps track of which host has which IP according to the latest data.
Can we abuse this system in our scenario? ID could work as the _time field. We just need to know at what IDs ingredients toggle between fresh and spoiled.
Overlapping ranges are going to be a problem though, so let's ignore the last range for now:
All we need to do is split each row into a start and an end (it'll be a much smaller lookup):
For the end boundary, we're adding a little bit to _time because the last ID in the range is still considered fresh.
We'll add a dummy field so that we have something to lookup and save it:
Now we'll add a new lookup definition linked to it:
Now let's lookup our sample data. To do so, we need a _time field (that's id) and of course our dummy field:
That's promising!
Now we have two obstacles:
limits?
overlapping ranges?
Exploring limits
If we take the highest ID, this for instance doesn't look good:
But it might still work, let's see. Let's take the highest range:
And look up a few test values:
Hmmm, it kind of works... The size of the id itself doesn't seem to be the problem (the last one is correctly identified as spoiled), but the lookup fails if the id is too different from the start of the range...
The problem is in the lookup definition. Did you notice the maximum offset field? Let's go back to the definition:
"The maximum time in seconds that the event time may be ahead of lookup entry time for a match to occur. Default is 2000000000." The maximum offset is 2 million by default, and we know our ranges are bigger than that. Adding a few zeros should do the trick.
Now it should be good. Let's add a fillnull as well:
Amazing! So that means scale is not an issue. Now the real problem is dealing with overlapping ranges.
De-overlapping ranges
As always, my goal with this series is to "show my thinking". Of course it's edited and smoothed out a bit, but I'm keeping some of the chaos so you can see that the road to a solution isn't straight. Also, there's definitely more than one way to do it, you just have to find one.
I guess it makes sense to start by sorting by start. Then let's use streamstats to bring in the previous value, and check if it overlaps. If it does, let's expand the current range.
Now I guess we could repeat that:
Ok that's starting to do something. If we ordered in decreasing order of end, we could identify fully redundant ranges...:
So if it's redundant, remove it, otherwise expand it:
Repeat the whole thing:
That seems to work. Let's do more testing. Might need to repeat the code a bit more:
Note: I'm not a huge fan of this approach. We could make it nicer with mvrange and map, as we've done before, but we still have the problem that we can't predict how many iterations will be necessary. I just can't think of a way to do a "while" loop in splunk.
What I do like is that we are leveraging splunk's mechanisms (timed lookup, but also just the streaming nature of SPL) and not putting undue load on the instance.
Going for it
Let's first save all the ingredient ranges:
Then let's modify our de-overlapping search so it can be repeated:
It only needs to be run a few times. Then we shape the timed lookup:
Unfortunately that does not give me the right answer for my challenge data...! :-(
Looking at our timed lookup, we can see we're not getting fresh and spoiled in a nice alternating pattern... If we colour start and end, and rename the _time field so we can see what it contains:
The problem occurs when a range is just one ID, so start and end are the same (in brown here). That means we need to modify our timifier search because this logic is flawed:
I guess streamstats will do:
By the way, I'm just noticing at the time of writing that we didn't take care of adjacent iID ranges. Here the second ID range, made of just 8074420174647, is contiguous with the previous range (ending at 8074420174646). We could revisit our approach to merge them. However, we don't need to, because...
...now we get the correct answer for part 1!
Part 2
That has to be the easiest thing ever. Since we got rid of duplicated/overlapping ranges, all we need to do is: