For argument's sake lets assume a HTML parser.
I've read that it *tokenizes* everything first, and then parses it.
What does tokenize mean?
Does the parser read every character each, building up a multi dimensional array to store the structure?
For example, does it read a `<` and then begin to capture the element, and then once it meets a closing `>` (outside of an attribute) it is pushed onto a array stack somewhere?
I'm interested for the sake of knowing (I'm curious).
If I were to read through the source of something like [HTML Purifier][1], would that give me a good idea of how HTML is parsed?
[1]: http://htmlpurifier.org/ Look at en.wikipedia.org/wiki/Lexical_parser for a very brief intro; also check out the Parsing article there. And HTML Purifier, at some point, does exactly that.
以上就是How does a parser (for example, HTML) work?的详细内容,更多请关注web前端其它相关文章!