Сокращение текста без разделения слов или разбиение html-тегов

Я пытаюсь отрезать текст после 236 символов без сокращения слов пополам и сохранения html-тегов. Это то, что я использую прямо сейчас:

$shortdesc = $_helper->productAttribute($_product, $_product->getShortDescription(), 'short_description');
$lenght = 236;
echo substr($shortdesc, 0, strrpos(substr($shortdesc, 0, $lenght), " "));

Пока это работает в большинстве случаев, он не будет уважать теги html. Так, например, этот текст:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>

будет отключен, пока тег остается открытым. Есть ли способ отрезать текст после 236 символов, но уважать теги html?

Ответ 1

Лучшее решение, с которым я столкнулся, это из среды CakePHP класса TextHelper

Вот метод

/**
* Truncates text.
*
* Cuts a string to the length of $length and replaces the last characters
* with the ending if the text is longer than length.
*
* ### Options:
*
* - `ending` Will be used as Ending and appended to the trimmed string
* - `exact` If false, $text will not be cut mid-word
* - `html` If true, HTML tags would be handled correctly
*
* @param string  $text String to truncate.
* @param integer $length Length of returned string, including ellipsis.
* @param array $options An array of html attributes and options.
* @return string Trimmed string.
* @access public
* @link http://book.cakephp.org/view/1469/Text#truncate-1625
*/
function truncate($text, $length = 100, $options = array()) {
    $default = array(
        'ending' => '...', 'exact' => true, 'html' => false
    );
    $options = array_merge($default, $options);
    extract($options);

    if ($html) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen(strip_tags($ending));
        $openTags = array();
        $truncate = '';

        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }
    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - mb_strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($html) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }
    $truncate .= $ending;

    if ($html) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

Другие структуры могут иметь похожие (или разные) решения этой проблемы, поэтому вы можете взглянуть на них тоже. Мое знакомство с Cake - вот что побудило меня связать их решение

Edit:

Просто протестирован этот метод в приложении, над которым я работаю с текстом OP

<?php 
echo truncate(
'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>', 
236, 
array('html' => true, 'ending' => '')); 
?>

Вывод:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubegre</strong>

Обратите внимание, что выход останавливается только после завершения последнего слова, но включает в себя полные сильные теги

Ответ 2

Это должно сделать это:

class Html{

  protected
    $reachedLimit = false,
    $totalLen     = 0,
    $maxLen       = 25,
    $toRemove     = array();

  public static function trim($html, $maxLen = 25){

    $dom = new DomDocument();
    $dom->loadHTML($html);

    $html = new static();
    $toRemove = $html->walk($dom, $maxLen);

    // remove any nodes that passed our limit
    foreach($toRemove as $child) 
      $child->parentNode->removeChild($child);

    // remove wrapper tags added by DD (doctype, html...)
    if(version_compare(PHP_VERSION, '5.3.6') < 0){
      // http://stackoverflow.com/a/6953808/1058140
      $dom->removeChild($dom->firstChild);            
      $dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);
      return $dom->saveHTML();
    }

    return $dom->saveHTML($dom->getElementsByTagName('body')->item(0));   
  }

  protected function walk(DomNode $node, $maxLen){

    if($this->reachedLimit){
      $this->toRemove[] = $node;

    }else{
      // only text nodes should have text,
      // so do the splitting here
      if($node instanceof DomText){
        $this->totalLen += $nodeLen = strlen($node->nodeValue);

        // use mb_strlen / mb_substr for UTF-8 support
        if($this->totalLen > $maxLen){
          $node->nodeValue = substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
          $this->reachedLimit = true;
        }

      }

      // if node has children, walk its child elements 
      if(isset($node->childNodes))
        foreach($node->childNodes as $child)
          $this->walk($child, $maxLen);
    }  

    return $this->toRemove;
  }  
}

Используйте как: $str = Html::trim($str, 236);

(здесь здесь)

Некоторые сравнения производительности между этим и регекс-решением cakePHP

Там очень мало различий, и при очень больших размерах строк DomDocument на самом деле быстрее. Надежность важнее, чем сохранение нескольких микросекунд, на мой взгляд.

Ответ 3

Могу ли я просто подумать?

Пример текста:

Lorem ipsum dolor sit amet, <i class="red">magna aliquyam erat</i>, duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong> hello

Сначала проанализируйте его на:

array(
    '0' => array(
        'tag' => '',
        'text' => 'Lorem ipsum dolor sit amet, '
    ),
    '1' => array(
        'tag' => '<i class="red">',
        'text' => 'magna aliquyam erat',
    )
    '2' => ......
    '3' => ......
)

затем вырежьте текст один за другим и оберните каждый его тегом после вырезания,

затем присоедините их.

Ответ 4

function limitStrlen($input, $length, $ellipses = true, $strip_html = true, $skip_html) 
{
    // strip tags, if desired
    if ($strip_html || !$skip_html) 
    {
        $input = strip_tags($input);

        // no need to trim, already shorter than trim length
        if (strlen($input) <= $length) 
        {
            return $input;
        }

        //find last space within length
        $last_space = strrpos(substr($input, 0, $length), ' ');
        if($last_space !== false) 
        {
            $trimmed_text = substr($input, 0, $last_space);
        } 
        else 
        {
            $trimmed_text = substr($input, 0, $length);
        }
    } 
    else 
    {
        if (strlen(strip_tags($input)) <= $length) 
        {
            return $input;
        }

        $trimmed_text = $input;

        $last_space = $length + 1;

        while(true)
        {
            $last_space = strrpos($trimmed_text, ' ');

            if($last_space !== false) 
            {
                $trimmed_text = substr($trimmed_text, 0, $last_space);

                if (strlen(strip_tags($trimmed_text)) <= $length) 
                {
                    break;
                }
            } 
            else 
            {
                $trimmed_text = substr($trimmed_text, 0, $length);
                break;
            }
        }

        // close unclosed tags.
        $doc = new DOMDocument();
        $doc->loadHTML($trimmed_text);
        $trimmed_text = $doc->saveHTML();
    }

    // add ellipses (...)
    if ($ellipses) 
    {
        $trimmed_text .= '...';
    }

    return $trimmed_text;
}

$str = "<h1><strong><span>Lorem</span></strong> <i>ipsum</i> <p class='some-class'>dolor</p> sit amet, consetetur.</h1>";

// view the HTML
echo htmlentities(limitStrlen($str, 22, false, false, true), ENT_COMPAT, 'UTF-8');

// view the result
echo limitStrlen($str, 22, false, false, true);

Примечание. Возможно, лучший способ закрыть теги вместо использования DOMDocument. Например, мы можем использовать p tag внутри a h1 tag, и он все равно будет работать. Но в этом случае тег заголовка будет закрыт до p tag, потому что теоретически невозможно использовать p tag внутри него. Поэтому будьте осторожны с строгими стандартами HTML.

Ответ 5

Я сделал в JS, надеюсь, эта логика тоже поможет в PHP.

splitText : function(content, count){
        var originalContent = content;
         content = content.substring(0, count);
          //If there is no occurance of matches before breaking point and the hit breakes in between html tags.
         if (content.lastIndexOf("<") > content.lastIndexOf(">")){
            content = content.substring(0, content.lastIndexOf('<'));
            count = content.length;
            if(originalContent.indexOf("</", count)!=-1){
                content += originalContent.substring(count, originalContent.indexOf('>', originalContent.indexOf("</", count))+1);
            }else{
                 content += originalContent.substring(count, originalContent.indexOf('>', count)+1);
            }
          //If the breaking point is in between tags.
         }else if(content.lastIndexOf("<") != content.lastIndexOf("</")){
            content = originalContent.substring(0, originalContent.indexOf('>', count)+1);
         }
        return content;
    },

Надеюсь, эта логика поможет кому-то...

Ответ 6

Вы можете использовать XML-подход и нажимать элементы на строку var до тех пор, пока длина строки не превысит 236

пример кода?

for each node // text or tag
  push to the string var

  if string length > 236
    break

endfor

для разбора HTML в PHP http://simplehtmldom.sourceforge.net/

Ответ 7

Вот решение JS: trim-html

Идея состоит в том, чтобы разделить HTML-строку таким образом, чтобы иметь массив с элементами, являющимися тегом html (открытым или закрытым) или просто строкой.

var arr = html.replace(/</g, "\n<")
              .replace(/>/g, ">\n")
              .replace(/\n\n/g, "\n")
              .replace(/^\n/g, "")
              .replace(/\n$/g, "")
              .split("\n");

Чем мы можем перебирать символы массива и числа.