Как я могу разобрать HTML-строку в Java?

С учетом строки "<table><tr><td>Hello World!</td></tr></table>", какой (самый простой) способ получить DOM Элемент, представляющий его?

Ответ 1

Я нашел это где-то (не помню, где):

 public static DocumentFragment parseXml(Document doc, String fragment)
 {
    // Wrap the fragment in an arbitrary element.
    fragment = "<fragment>"+fragment+"</fragment>";
    try
    {
        // Create a DOM builder and parse the fragment.
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        Document d = factory.newDocumentBuilder().parse(
                new InputSource(new StringReader(fragment)));

        // Import the nodes of the new document into doc so that they
        // will be compatible with doc.
        Node node = doc.importNode(d.getDocumentElement(), true);

        // Create the document fragment node to hold the new nodes.
        DocumentFragment docfrag = doc.createDocumentFragment();

        // Move the nodes into the fragment.
        while (node.hasChildNodes())
        {
            docfrag.appendChild(node.removeChild(node.getFirstChild()));
        }
        // Return the fragment.
        return docfrag;
    }
    catch (SAXException e)
    {
        // A parsing error occurred; the XML input is not valid.
    }
    catch (ParserConfigurationException e)
    {
    }
    catch (IOException e)
    {
    }
    return null;
}

Ответ 2

Вот путь:

import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {
       Reader reader = new StringReader("<table><tr><td>Hello</td><td>World!</td></tr></table>");
       HTMLEditorKit.Parser parser = new ParserDelegator();
       parser.parse(reader, new HTMLTableParser(), true);
       reader.close();
   }
}

class HTMLTableParser extends HTMLEditorKit.ParserCallback {

    private boolean encounteredATableRow = false;

    public void handleText(char[] data, int pos) {
        if(encounteredATableRow) System.out.println(new String(data));
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if(t == HTML.Tag.TR) encounteredATableRow = true;
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        if(t == HTML.Tag.TR) encounteredATableRow = false;
    }
}

Ответ 3

вы можете использовать HTML Parser, который библиотека Java использовала для анализа HTML в линейном или вложенном виде. Это инструмент с открытым исходным кодом и его можно найти на SourceForge

Ответ 4

Если у вас есть строка, содержащая HTML, вы можете использовать Jsoup библиотеку, чтобы получить HTML-элементы:

String htmlTable= "<table><tr><td>Hello World!</td></tr></table>";
Document doc = Jsoup.parse(htmlTable);

// then use something like this to get your element:
Elements tds = doc.getElementsByTag("td");

// tds will contain this one element: <td>Hello World!</td>

Удачи!

Ответ 5

Вы можете использовать Swing:

Как вы используете Возможности HTML-обработки, которые встроенный в Java? Возможно, вы не знаете, что Swing содержит все классы необходимо проанализировать HTML. Джефф Хитон показывает, как.

Ответ 6

Я использовал Иерихон HTML Parser он OSS, обнаруживает (прощает) плохо отформатированные теги и является легким