Let's say we have a Foo class with a DateTime Timestamp property inside. We have a fancy method that parses Foo instances out of XML files in some directory:
public IEnumerable<Foo> GetAllTheFoos(DirectoryInfo dir)
{
foreach(FileInfo fi in dir.EnumerateFiles("foo*.xml", SearchOption.TopDirectoryOnly))
{
using(FileStream fs = fi.OpenRead())
yield return Foo.CreateFromXML(fs);
}
}
For you to gain perspective, I can say that data in these files has been recorded for about 2 years at frequency of usually several Foo's every minute.
Now: we have a parameter called TimeSpan TrainingPeriod which is about 15 days for example. What I'd like to accomplish is to call:
var allTheData = GetAllTheFoos(myDirectory);
and obtain IEnumerable<Foo> TrainingSet, TestSet of it, where TrainingSet consists of the Foos from the first 15 days of recording, and the TestSet of all the rest. In other words, my code should semantically be equvalent to:
TimeSpan TrainingPeriod = new TimeSpan(15, 0, 0); // hope it says 15 days
var allTheData = GetAllTheFoos(myDirectory);
List<Foo> allTheDataList = allTheData.ToList();
var threshold = allTheDataList[0].Timestamp + TrainingPeriod;
List<Foo> TrainingSet = allTheDataList.Where(foo => foo.Timestamp < threshold).ToList();
List<Foo> TestSet = allTheDataList.Where(foo => foo.Timestamp >= threshold).ToList();
By the way the XML file naming convention ensures me, that Foos will be returned in chronological order. Of course, I do not want to store it all in memory, which happens every time .ToList() is called. So I came up with another solution:
TimeSpan TrainingPeriod = new TimeSpan(15, 0, 0);
var allTheData = GetAllTheFoos(myDirectory);
var threshold = allTheDataList.First().Timestamp + TrainingPeriod; // a minor issue
var grouped = from foo in allTheData
group foo by foo.Timestamp < Training;
var TrainingSet = grouped.First(g => g.Key);
var TestSet = grouped.First(g => !g.Key); // the major one
However, there is a minor and a major issue about that piece of code. The minor one is that the first file is read twice at least - doesn't matter actually. But it looks like TrainingSet and TestSet access the directory independently, read every file twice and select only those holding a particular timestamp constraint. I'm not too puzzled by that - in fact if it worked I would be puzzled and would have to rethink LINQ once again. But this raises file-access issues, and every file is parsed two times, which is a total waste of CPU time.
So my question is: can I achieve this effect using only simple LINQ/C# tools? I think I can do this in a good ol' brute-force way, overriding some GetEnumerator(), MoveNext() methods and so on - please don't bother typing it, I can totally handle this on my own.
However, if there is some elegant, short&sweet solution to this, it would be highly appreciated.
Thank you!
 
No comments:
Post a Comment