How do I modify the html page with python?

original page html code:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head> 
  <meta charset="utf-8" /> 
  <meta content="pdf2htmlEX" name="generator" /> 
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /> 
  <title></title> 
 </head> 
 <body>
  <div class="t m0 x0 h3 y45e ff2 fs1 fc1 sc0 ls0 ws102">
   abcd
   <span class="ws0">adc</span>
  </div>
  <div class="t m0 x0 h3 y45f ff2 fs1 fc1 sc0 ls0 wse">
   ab
  </div>
  <div class="t m0 xd5 hb y4be ff2 fs3 fc1 sc0 ls7 wse3">
   SUP
   <span class="_ _93"> </span>
   OUT
   <span class="_ _a1"> </span>
   OUT
  </div>
  <div class="t m0 xff h3 y4c1 ff2 fs1 fc1 sc0 ls5c ws10b">
   (V
   <span class="_ _54"> </span>
   V
   <span class="_ _a0">b<span class="_ _92">aa</span></span>
   V
  </div>
 </body>
</html>

to use the python program, modify the html page to look like this:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head> 
  <meta charset="utf-8" /> 
  <meta content="pdf2htmlEX" name="generator" /> 
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /> 
  <title></title> 
 </head> 
 <body>
  <div class="t m0 x0 h3 y45e ff2 fs1 fc1 sc0 ls0 ws102">
   4
   <span class="ws0">3</span>
  </div>
  <div class="t m0 x0 h3 y45f ff2 fs1 fc1 sc0 ls0 wse">
   2
  </div>
  <div class="t m0 xd5 hb y4be ff2 fs3 fc1 sc0 ls7 wse3">
   3
   <span class="_ _93">1</span>
   3
   <span class="_ _a1">1</span>
   3
  </div>
  <div class="t m0 xff h3 y4c1 ff2 fs1 fc1 sc0 ls5c ws10b">
   2
   <span class="_ _54">1</span>
   2
   <span class="_ _a0">1<span class="_ _92">2</span></span>
   1
  </div>
 </body>
</html>

comparing the two page codes, you can see that wants to replace each text within each tag with the number of digits of the text, while ensuring that the original dom structure and tag attributes do not change , and finally save the results as a new page.

I can"t figure it out with beautifulsoup. Is this requirement too weird? Ask the gods for help. The above page is just an example, the real page dom structure is more nested, hard-coding is meaningless. )


import re

def f(m):
    s = m.group(1)
    length = len(s.strip())
    if length == 0:
        return '>{}<'.format(s)
    return '>{}<'.format(re.sub('\S+.?\S?', str(length), s))

p = re.compile('>(.*?)<', re.S)
print(p.sub(f, html))

import re
with open('1.html', 'r') as r:
    txt = ''.join(r.readlines())

print(txt)  -sharp html

def replace(match):
    t, s = match.group(1), match.group(1).strip()
    return '>%s<' % (t.replace(s, str(len(s))) if s else t)


txt1 = re.sub(r'>([.\S\s]*?)<', replace, txt)

print(txt1)  -sharp html

find a html parser. The transformed structure finds the text node and is replaced by the length of the text


recursively parsed and reconstructed with lxml http://lxml.de/


.

javascript is recommended. It can't be simpler.

browser < kbd > F12 < / kbd >, paste the following code into console.

function walk(node, fn) {
    if (node) do {
            fn(node);
            walk(node.firstChild, fn);       
    } while (node = node.nextSibling);
}
 
walk(document.body, function(node) {
        if(node.nodeType==1 || node.nodeType==3){       
                console.log(node.nodeValue);   
                node.nodeValue = (node.nodeValue+"").length;       
    }
});

there are pictures and truths:

Menu