Разделенная строка Python без разделения экранированного символа

Есть ли способ разделить строку без разделения экранированного символа? Например, у меня есть строка и вы хотите разделить на ":", а не на "\:"

http\://www.example.url:ftp\://www.example.url

Результат должен быть следующим:

['http\://www.example.url' , 'ftp\://www.example.url']

Ответ 1

Обратите внимание, что: не представляется символом, который нуждается в экранировании.

Самый простой способ, который я могу придумать для этого, - разделить на персонажа, а затем добавить его обратно, когда он экранирован.

Пример кода (В значительной степени необходимо некоторое улучшение):

def splitNoEscapes(string, char):
    sections = string.split(char)
    sections = [i + (char if i[-1] == "\\" else "") for i in sections]
    result = ["" for i in sections]
    j = 0
    for s in sections:
        result[j] += s
        j += (1 if s[-1] != char else 0)
    return [i for i in result if i != ""]

Ответ 2

Существует гораздо более простой способ использования регулярного выражения с отрицательным утверждением lookbehind:

re.split(r'(?<!\\):', str)

Ответ 3

Как говорит Игнасио, да, но не тривиально за один раз. Проблема в том, что вам нужно получить обратную связь, чтобы определить, есть ли у вас экранированный разделитель или нет, а базовый string.split не обеспечивает эту функциональность.

Если это не внутри жесткого цикла, поэтому производительность не является существенной проблемой, вы можете сделать это, сначала разделив на экранированные разделители, затем выполнив разделение, а затем слияние. Ужасный демонстрационный код:

# Bear in mind this is not rigorously tested!
def escaped_split(s, delim):
    # split by escaped, then by not-escaped
    escaped_delim = '\\'+delim
    sections = [p.split(delim) for p in s.split(escaped_delim)] 
    ret = []
    prev = None
    for parts in sections: # for each list of "real" splits
        if prev is None:
            if len(parts) > 1:
                # Add first item, unless it also the last in its section
                ret.append(parts[0])
        else:
            # Add the previous last item joined to the first item
            ret.append(escaped_delim.join([prev, parts[0]]))
        for part in parts[1:-1]:
            # Add all the items in the middle
            ret.append(part)
        prev = parts[-1]
    return ret

s = r'http\://www.example.url:ftp\://www.example.url'
print (escaped_split(s, ':')) 
# >>> ['http\\://www.example.url', 'ftp\\://www.example.url']

В качестве альтернативы, проще всего следовать логике, если вы просто разделите строку вручную.

def escaped_split(s, delim):
    ret = []
    current = []
    itr = iter(s)
    for ch in itr:
        if ch == '\\':
            try:
                # skip the next character; it has been escaped!
                current.append('\\')
                current.append(next(itr))
            except StopIteration:
                pass
        elif ch == delim:
            # split! (add current to the list and reset it)
            ret.append(''.join(current))
            current = []
        else:
            current.append(ch)
    ret.append(''.join(current))
    return ret

Обратите внимание, что эта вторая версия ведет себя несколько иначе, когда она встречает двойные экраны, за которыми следует разделитель: эта функция позволяет экранировать escape-символы, так что escaped_split(r'a\\:b', ':') возвращает ['a\\\\', 'b'], потому что первый \ ускользает от второго, оставляя : интерпретированным как реальный разделитель. Так что кое-что, чтобы следить за.

Ответ 4

Отредактированная версия Генри отвечает с совместимостью Python3, тестирует и исправляет некоторые проблемы:

def split_unescape(s, delim, escape='\\', unescape=True):
    """
    >>> split_unescape('foo,bar', ',')
    ['foo', 'bar']
    >>> split_unescape('foo$,bar', ',', '$')
    ['foo,bar']
    >>> split_unescape('foo$$,bar', ',', '$', unescape=True)
    ['foo$', 'bar']
    >>> split_unescape('foo$$,bar', ',', '$', unescape=False)
    ['foo$$', 'bar']
    >>> split_unescape('foo$', ',', '$', unescape=True)
    ['foo$']
    """
    ret = []
    current = []
    itr = iter(s)
    for ch in itr:
        if ch == escape:
            try:
                # skip the next character; it has been escaped!
                if not unescape:
                    current.append(escape)
                current.append(next(itr))
            except StopIteration:
                if unescape:
                    current.append(escape)
        elif ch == delim:
            # split! (add current to the list and reset it)
            ret.append(''.join(current))
            current = []
        else:
            current.append(ch)
    ret.append(''.join(current))
    return ret

Ответ 5

Вот эффективное решение, которое правильно обрабатывает двойные экраны, т.е. любой последующий разделитель не экранируется. Он игнорирует неправильный одиночный escape-код в качестве последнего символа строки.

Он очень эффективен, потому что он выполняет итерацию по входной строке ровно один раз, манипулируя индексами, а не копируя строки вокруг. Вместо создания списка он возвращает генератор.

def split_esc(string, delimiter):
    if len(delimiter) != 1:
        raise ValueError('Invalid delimiter: ' + delimiter)
    ln = len(string)
    i = 0
    j = 0
    while j < ln:
        if string[j] == '\\':
            if j + 1 >= ln:
                yield string[i:j]
                return
            j += 1
        elif string[j] == delimiter:
            yield string[i:j]
            i = j + 1
        j += 1
    yield string[i:j]

Чтобы ограничители были длиннее одного символа, просто переместите я и j по длине разделителя в случае "elif". Это предполагает, что один escape-символ пропускает весь разделитель, а не один символ.

Протестировано с помощью Python 3.5.1.

Ответ 6

Я думаю, что простой синтаксический анализ C был бы намного более простым и надежным.

def escaped_split(str, ch):
    if len(ch) > 1:
        raise ValueError('Expected split character. Found string!')
    out = []
    part = ''
    escape = False
    for i in range(len(str)):
        if not escape and str[i] == ch:
            out.append(part)
            part = ''
        else:
            part += str[i]
            escape = not escape and str[i] == '\\'
    if len(part):
        out.append(part)
    return out

Ответ 7

Для этого нет встроенной функции. Здесь эффективная, общая и проверенная функция, которая даже поддерживает разделители любой длины:

def escape_split(s, delim):
    i, res, buf = 0, [], ''
    while True:
        j, e = s.find(delim, i), 0
        if j < 0:  # end reached
            return res + [buf + s[i:]]  # add remainder
        while j - e and s[j - e - 1] == '\\':
            e += 1  # number of escapes
        d = e // 2  # number of double escapes
        if e != d * 2:  # odd number of escapes
            buf += s[i:j - d - 1] + s[j]  # add the escaped char
            i = j + 1  # and skip it
            continue  # add more to buf
        res.append(buf + s[i:j - d])
        i, buf = j + len(delim), ''  # start after delim

Ответ 8

Я создал этот метод, который вдохновлен ответом Генри Кейтера, но имеет следующие преимущества:

Переменная escape-символ и разделитель
Не удаляйте escape-символ, если он на самом деле не ускользает.

Это код:

def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
    result = []
    current_element = []
    iterator = iter(string)
    for character in iterator:
        if character == self.release_indicator:
            try:
                next_character = next(iterator)
                if next_character != delimiter and next_character != escape:
                    # Do not copy the escape character if it is inteded to escape either the delimiter or the
                    # escape character itself. Copy the escape character if it is not in use to escape one of these
                    # characters.
                    current_element.append(escape)
                current_element.append(next_character)
            except StopIteration:
                current_element.append(escape)
        elif character == delimiter:
            # split! (add current to the list and reset it)
            result.append(''.join(current_element))
            current_element = []
        else:
            current_element.append(character)
    result.append(''.join(current_element))
    return result

Это тестовый код, указывающий на поведение:

def test_split_string(self):
    # Verify normal behavior
    self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))

    # Verify that escape character escapes the delimiter
    self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))

    # Verify that the escape character escapes the escape character
    self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))

    # Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
    self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))

Ответ 9

построение на @user629923 предложения, но гораздо проще, чем другие ответы:

import re
DBL_ESC = "!double escape!"

s = r"Hello:World\:Goodbye\\:Cruel\\\:World"

map(lambda x: x.replace(DBL_ESC, r'\\'), re.split(r'(?<!\\):', s.replace(r'\\', DBL_ESC)))