Clean a WordPress database in C#

example

This example shows how you can use file processing methods to clean a WordPress database. Unfortunately my blog’s exported data contains a huge amount of spam. A single post had more than 500 pages of spam comments! Wading through the garbage to find the actual posts isn’t hard, but it’s time consuming. This example allows me to easily remove the spam comments and pings. It also lets me remove EXTENDED BODY, EXCERPT, and KEYWORDS sections, which don’t contain any data due to the way the Quick Blogcast software exported its data.

The following code shows how the program removes the desired sections of data.

// Remove the comments.
private void btnRemoveSections_Click(object sender, EventArgs e)
{
    lblResults.Text = "";
    Cursor = Cursors.WaitCursor;
    Refresh();

    // See what sections we want to remove.
    string targets = "\n";
    if (chkComment.Checked) targets += "COMMENT:\n";
    if (chkExtendedBody.Checked) targets += "EXTENDED BODY: \n";
    if (chkExcerpt.Checked) targets += "EXCERPT: \n";
    if (chkKeywords.Checked) targets += "KEYWORDS: \n";
    if (chkPing.Checked) targets += "PING:\n";

    // Read the input file.
    string[] input_lines = File.ReadAllLines(txtInput.Text);

    // Build the output.
    int num_removed = 0;
    bool reading_text = true;
    List<string> output_lines = new List<string>();
    foreach (string line in input_lines)
    {
        if (reading_text)
        {
            // We're reading text. See if we should stop.
            if (targets.Contains('\n' + line + '\n'))
            {
                num_removed++;
                reading_text = false;
            }
            else output_lines.Add(line);
        }
        else
        {
            // We're not reading text. See if we should start.
            if (line == "-----") reading_text = true;
        }
    }

    // Save the result.
    File.WriteAllLines(txtOutput.Text, output_lines.ToArray());
    lblResults.Text = "Removed " + num_removed.ToString() +
        " sections";
    Cursor = Cursors.Default;
}

This code uses the form’s CheckBoxes to build a string containing target tokens separated by newline characters. For example, in the WordPress data, a comment begins with the line “COMMENT:” and ends with the line “—–“.

Next the program reads the input data file into an array of strings. It then loops through that array.

The variable reading_text keeps track of whether the program is reading something that we want to keep. Inside the loop, if reading_text is true, then we are reading something we want to keep. n that case, the program checks the current line of text to see if it is inside the targets string containing the tokens for sections we want to remove. For example, if the current line is the beginning of a comment. If the line is in the targets string, the program increments num_removed so we remember we removed a section. It also sets reading_text to false so the program doesn’t read any more data for a while.

If the current line isn’t in the targets string, then the program simply adds it to the output list.

If reading_text is false inside the loop, gthe program should skip the current line. It checks the line to see if it is “—–” indicating the end of the section we are skipping. If the line is “—–,” the program sets reading_text to true so the program starts saving text again.

After it has processed all of the lines in the input file, the program uses File.WriteAllLines to write the saved lines into the output file.

The example ZIP file includes a small data file that contains one post with a few sections to remove. In my data, after I already processed several hundred posts, the example removed 3,594 unnecessary sections and reduced the file’s size from 3.97 MB to 3.42 MB.


Download Example   Follow me on Twitter   RSS feed   Donate




About RodStephens

Rod Stephens is a software consultant and author who has written more than 30 books and 250 magazine articles covering C#, Visual Basic, Visual Basic for Applications, Delphi, and Java.
This entry was posted in files and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *