Removing HTML Tags from HTML Source Code

Removing HTML Tags from HTML Source Code

How do search engines see your web pages? You can get the fields you want from the site content by using the StripHTML function, which removes all HTML tags from a certain text string.

For example,

Safir Medya

converted as follows: 

Safir Medya

The StripHTML function we will use for this : 

function StripHTML(S: string): string;
var
  TagBegin, TagEnd, TagLength: integer;
begin
  TagBegin := Pos( '<', S);  // search position of first <

  while (TagBegin > 0) do begin  // while there is a < in S
    TagEnd := Pos('>', S);  // find the matching >
    TagLength := TagEnd - TagBegin + 1;
    Delete(S, TagBegin, TagLength); // delete the tag
    TagBegin:= Pos( '<', S);        // search for next <
  end;
  
  Result := S;                // give the result
end;

So how do we use this function in Delphi: 

procedure TForm1.Button1Click(Sender: TObject);
begin
  Memo2.Text := StripHTML(Memo1.Text);
end;

Your comments are valuable to us. You can leave a comment under the subject. Thanks.

Click For More Delphi Solutions DELPHI BLOG

Click for more Delphi Source Code and Project Examples DELPHI SOURCE CODES
 

  • user

    Guy Gordon

    This code assumes correct HTML. Real-world websites often contain errors. Think of the code as a State Machine with 2 states: InsideTag = True or False. While inside a tag you might find another '&lt;'. And while not in a tag, you might find a '>'. E.G. <br class="Apple-interchange-newline"> id="aswift_3" ...> (actual example from eBay) In both of these cases the state machine may be out-of-step with the input stream. To get back in-step, the code needs determine the correct State from the surrounding text. This is non-trivial.

    2 years ago