Ask the boss to help write a regular expression, thank you very much!

problem description

my regular expression has always been a blind spot. I hope my friends can help write a regular expression to extract the title, picture link, article link and description of the following web page. Thank you here!

regular web page text content is required

<article class="excerpt excerpt-1">
            <a href="/szb/eth/28157.html" class="focus" target="_blank"><img alt=""" " class="thumb lazy" data-original="/uploads/allimg/180906/8-1PZ6094Za45-lp.png"/></a>
            <header>
                <h2><a href="/szb/eth/28157.html" title="<b>"" </b>" target="_blank"><b>"" </b></a></h2>
            </header>
            <p class="meta">
                <time><i class="fa fa-clock-o"></i><font color="-sharpe15c34">2018-09-06</font></time>
                <span class="pv"><i class="fa fa-eye"></i>(1986)</span>
                <span class="pc"><i class="fa fa-comments-o"></i>(<span id="url::http://www.bitcoin86.com/szb/eth/28157.html" class = "cy_cmt_count" ></span>)</span>
            

<p class="note">(CBOE) ETH Business Insider CBOE2018 2017...

</article>

what result do you expect? What is the error message actually seen?

I need to extract the href from the A tag as the text content of the tag in URL
< header > as the article link. The data-original attribute in the
tag serves as a link to the picture. Text in
< p class= "note" > as a description.

because I"m not familiar with regularities, I don"t know if I can get all the above four attributes in one expression and put them into an array list with the indexes of 0meme 1meme 2meme 3

.

if the above idea is not realistic, I hope the god who knows it can help write four regular expressions. Thank you again.

< hr >

my problem has been solved by myself, but if you have a good solution, you are welcome to post it to help other people in need.


Python can do xpath matching directly with the etree package in lxml


regular expressions written in PHP

preg_match_all('/<h2><a href="(.*?)" .*><b>(.*?)<\/b>.*<\/h2>/', $data, $title);
$href = $title[1][0];
$title = $title[2][0];
echo $title.'<br>';//
echo $href.'<br>';//

preg_match_all('/<img.* class="thumb lazy" data-original="(.*?)"\/>/', $data, $img);
$img = $img[1][0];
echo $img.'<br>';//

preg_match_all('/<p class="note">(.*?)<\/p>/', $data, $message);
$content = $message[1][0];
echo $content.'<br>';//

effect:

The regularity of

JS should be similar to this. Please refer to

.
  1. look for things from web pages, do not use rules, because the writing will be very complex, poor versatility. Most languages have off-the-shelf packages, so just install a build traversal.
  2. Learning rules recommends 30-minute introduction to regular expressions
Menu