Подтвердить что ты не робот

Каким образом Guava Splitter.onPattern(..). Split() отличается от String.split(..)?

Недавно я использовал мощь обычного регулярного выражения для разбиения строки:

"abc8".split("(?=\\d)|\\W")

При печати на консоль это выражение возвращает:

[abc, 8]

Очень доволен этим результатом, я хотел передать его в Guava для дальнейшей разработки, которая выглядела так:

Splitter.onPattern("(?=\\d)|\\W").split("abc8")

К моему удивлению, выход изменился на:

[abc]

Почему?

Ответ 1

Вы нашли ошибку!

System.out.println(s.split("abc82")); // [abc, 8]
System.out.println(s.split("abc8"));  // [abc]

Это метод, который Splitter использует для фактического разделения String (Splitter.SplittingIterator::computeNext):

@Override
protected String computeNext() {
  /*
   * The returned string will be from the end of the last match to the
   * beginning of the next one. nextStart is the start position of the
   * returned substring, while offset is the place to start looking for a
   * separator.
   */
  int nextStart = offset;
  while (offset != -1) {
    int start = nextStart;
    int end;

    int separatorPosition = separatorStart(offset);

    if (separatorPosition == -1) {
      end = toSplit.length();
      offset = -1;
    } else {
      end = separatorPosition;
      offset = separatorEnd(separatorPosition);
    }

    if (offset == nextStart) {
      /*
       * This occurs when some pattern has an empty match, even if it
       * doesn't match the empty string -- for example, if it requires
       * lookahead or the like. The offset must be increased to look for
       * separators beyond this point, without changing the start position
       * of the next returned substring -- so nextStart stays the same.
       */
      offset++;
      if (offset >= toSplit.length()) {
        offset = -1;
      }
      continue;
    }

    while (start < end && trimmer.matches(toSplit.charAt(start))) {
      start++;
    }
    while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {
      end--;
    }

    if (omitEmptyStrings && start == end) {
      // Don't include the (unused) separator in next split string.
      nextStart = offset;
      continue;
    }

    if (limit == 1) {
      // The limit has been reached, return the rest of the string as the
      // final item.  This is tested after empty string removal so that
      // empty strings do not count towards the limit.
      end = toSplit.length();
      offset = -1;
      // Since we may have changed the end, we need to trim it again.
      while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {
        end--;
      }
    } else {
      limit--;
    }

    return toSplit.subSequence(start, end).toString();
  }
  return endOfData();
}

Интересующая область:

if (offset == nextStart) {
  /*
   * This occurs when some pattern has an empty match, even if it
   * doesn't match the empty string -- for example, if it requires
   * lookahead or the like. The offset must be increased to look for
   * separators beyond this point, without changing the start position
   * of the next returned substring -- so nextStart stays the same.
   */
  offset++;
  if (offset >= toSplit.length()) {
    offset = -1;
  }
  continue;
}

Эта логика отлично работает, если пустое совпадение не происходит в конце String. Если пустое совпадение происходит в конце String, это приведет к пропуску этого символа. Эта часть должна выглядеть (уведомление >= → >):

if (offset == nextStart) {
  /*
   * This occurs when some pattern has an empty match, even if it
   * doesn't match the empty string -- for example, if it requires
   * lookahead or the like. The offset must be increased to look for
   * separators beyond this point, without changing the start position
   * of the next returned substring -- so nextStart stays the same.
   */
  offset++;
  if (offset > toSplit.length()) {
    offset = -1;
  }
  continue;
}

Ответ 2

В Guava Splitter появляется ошибка, когда шаблон соответствует пустой строке. Если вы попытаетесь создать Matcher и распечатать то, что оно соответствует:

Pattern pattern = Pattern.compile("(?=\\d)|\\W");
Matcher matcher = pattern.matcher("abc8");
while (matcher.find()) {
    System.out.println(matcher.start() + "," + matcher.end());
}

Вы получаете вывод 3,3, который делает его похожим на то, что он будет соответствовать 8. Поэтому он просто разбивается, в результате получается только abc.

Вы можете использовать, например. Pattern#split(String), который, как представляется, дает правильный результат:

Pattern.compile("(?=\\d)|\\W").split("abc8")