Monday, January 26, 2009

Parsing Text Files

In teaching my online courses at the University of Phoenix, considerable time is spent in the evaluation of student's postings. I wanted to be able to do keyword searches, statistics, etc. and record the information into a database for reporting purposes. Therefore, I needed something that I can copy to the clipboard, parse the text, and then save to XML for the database.

Lysle (2008) has an article on parsing that I can use for these purposes at http://www.csharpcorner.com/UploadFile/scottlysle/ParseSentencesCS02112008055809AM/ParseSentencesCS.aspx. Here is the user interface in Figure 1 with the sample application in C#.



Figure 1. Parsing Interface

The author illustrates three methods from the article:
  • Parse Reasonable: Split the text using typical sentence terminations and keep the sentence termination.

  • Parse Best: Split the text based upon the use of a regular expressions
  • Parse Without Endings: Split the text without sentence terminations.

Of course, I would like to be able to copy and paste the text from the clipboard. Here is an example of the two methods for the controls btnCopy and btnPaste.

private void btnCopy_Click(object sender, EventArgs e)
{
Clipboard.SetText(txtBoxA.Text);
}
//paste the text
private void btnPaste_Click(object sender, EventArgs e)
{
txtBoxB.Text = Clipboard.GetText();
}


Of course, we can also parse to XML with the following code by Cochran at http://www.csharpcorner.com/UploadFile/rmcochran/FlatFileToXmlDocument06302007111353AM/FlatFileToXmlDocument.aspx . Here is the example,


using System;
using System.Collections.Generic;
using System.Text;using System.Xml;
using System.Text.RegularExpressions;
using System.IO;

namespace FlatFileParser
{
public static class Parser

{

#region Member Variables
private const string
strComma = ",",
strTemporaryPlaceholder = "~~`~~",
strTab = "\t";
private static readonly Regex
m_commaFixer = new Regex("\".*?,.*?\"", RegexOptions.Compiled), m_quotesOnBothEnds = new Regex("^\".*\"$", RegexOptions.Compiled);

#endregion
#region Methods

public static XmlDocument ParseTabToXml(string input, string topElementName, string recordElementName,
params string[] recordItemElementName)
{

XmlDocument doc = ParseToXml(input, new char[] { strTab[0] }, topElementName, recordElementName,
recordItemElementName);

PostProcess(doc, PostProcessTabNode);

return doc;

}
public static XmlDocument ParseCsvToXml(string input, string topElementName, string recordElementName,
params string[] recordItemElementName)

{

input = PreProcessCSV(input);
XmlDocument doc = ParseToXml(input, new char[] { strComma[0] }, topElementName, recordElementName,
recordItemElementName);
PostProcess(doc, PostProcessCsvNode); return doc;

}
#endregion
#region Utility Methods

private static XmlDocument ParseToXml(string input, char[] seperator, string topElementName, string recordElementName, string[]recordItemElementName)
{
string[][] data = Dissasemble(input, seperator);

return BuildDocument(data, topElementName, recordElementName, recordItemElementName);

}
private static string PreProcessCSV(string input)
{

MatchCollection collection = m_commaFixer.Matches(input);
foreach (Match m in collection) input = input.Replace( m.Value, m.Value.Substring(1, m.Value.Length - 2).Replace(strComma, strTemporaryPlaceholder));
return input;

}
private static void PostProcess(XmlNode node, Action process)

{
process(node);
foreach (XmlNode subNode in node.ChildNodes)
PostProcess(subNode, process);
}
private static void PostProcessTabNode(XmlNode node)

{
if (!String.IsNullOrEmpty(node.Value) && m_quotesOnBothEnds.IsMatch(node.Value)) node.Value = node.Value.Substring(1, node.Value.Length - 2);

}
private static void PostProcessCsvNode(XmlNode node)
{
if(! String.IsNullOrEmpty(node.Value))
node.Value = node.Value.Replace(strTemporaryPlaceholder, strComma);
}

Using these building blocks, one can build an application to accomplish the ideas presented above.

No comments: