Найти indexOf массив байтов в другом массиве байтов

Учитывая массив байтов, как я могу найти в нем положение (меньшего) байтового массива?

Эта документация выглядела многообещающей, используя ArrayUtils, но если я прав, это позволит мне найти отдельный байт в массиве, который нужно искать.

(Я не вижу в этом никакого значения, но на всякий случай: иногда массив байтов поиска будет регулярным символом ASCII, иначе он будет управляющими символами или расширенными символами ASCII, поэтому использование операций со строками не всегда будет подходящим)

Большой массив может составлять от 10 до 10000 байт, а меньший - около 10. В некоторых случаях у меня будет несколько меньших массивов, которые я хочу найти в более крупном массиве в одном поиске. И я иногда захочу найти последний индекс экземпляра, а не первый.

Ответ 1

Строки Java состоят из 16-разрядных char s, а не 8-разрядных byte s. A char может содержать byte, поэтому вы всегда можете сделать свои байтовые массивы в строки и использовать indexOf: символы ASCII, управляющие символы и даже нулевые символы будут работать нормально.

Вот демо:

byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));

Отпечатает 7.

Однако, учитывая, что ваш большой массив может составлять до 10 000 байт, а малый массив - всего десять байт, это решение может быть не самым эффективным по двум причинам:

Требуется скопировать большой массив в массив, который в два раза больше (такая же емкость, но с char вместо byte). Это увеличивает ваши потребности в памяти.
Строковый алгоритм поиска Java не самый быстрый из доступных. Вы можете получить достаточно быстро, если вы реализуете один из продвинутых алгоритмов, например, Knuth-Morris-Pratt. Это может привести к снижению скорости выполнения в десять раз (длина маленькой строки) и потребует дополнительной памяти, пропорциональной длине маленькой строки, а не большой строке.

Ответ 2

Симпольным способом было бы сравнить каждый элемент:

public int indexOf(byte[] outerArray, byte[] smallerArray) {
    for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
        boolean found = true;
        for(int j = 0; j < smallerArray.length; ++j) {
           if (outerArray[i+j] != smallerArray[j]) {
               found = false;
               break;
           }
        }
        if (found) return i;
     }
   return -1;  
}

Некоторые тесты:

@Test
public void testIndexOf() {
  byte[] outer = {1, 2, 3, 4};
  assertEquals(0, indexOf(outer, new byte[]{1, 2}));
  assertEquals(1, indexOf(outer, new byte[]{2, 3}));
  assertEquals(2, indexOf(outer, new byte[]{3, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}

По мере обновления вашего вопроса: строки Java - это строки UTF-16, они не заботятся о расширенном наборе ASCII, поэтому вы можете использовать string.indexOf()

Ответ 3

Google Guava предоставляет Bytes.indexOf(байт [] массив, байт [] target).

Ответ 4

Это то, что вы ищете?

public class KPM {
    /**
     * Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
     * @param data
     * @param start First index in data
     * @param stop Last index in data so that stop-start = length
     * @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
     * @return
     */
    public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
        if( data == null || pattern == null) return -1;

        int[] failure = computeFailure(pattern);

        int j = 0;

        for( int i = start; i < stop; i++) {
            while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
                j = failure[j - 1];
            }
            if (pattern[j] == '*' || pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j>0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}

Ответ 5

Чтобы сэкономить время на тестировании:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

дает код, который работает, если вы выполняете computeFailure() static:

public class KPM {
    /**
     * Search the data byte array for the first occurrence 
     * of the byte array pattern.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
    int[] failure = computeFailure(pattern);

    int j = 0;

    for (int i = 0; i < data.length; i++) {
        while (j > 0 && pattern[j] != data[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == data[i]) { 
            j++; 
        }
        if (j == pattern.length) {
            return i - pattern.length + 1;
        }
    }
    return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
    int[] failure = new int[pattern.length];

    int j = 0;
    for (int i = 1; i < pattern.length; i++) {
        while (j>0 && pattern[j] != pattern[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == pattern[i]) {
            j++;
        }
        failure[i] = j;
    }

    return failure;
    }
}

Поскольку всегда полезно проверить код, который вы заимствуете, вы можете начать с:

public class Test {
    public static void main(String[] args) {
        do_test1();
    }
    static void do_test1() {
      String[] ss = { "",
                    "\r\n\r\n",
                    "\n\n",
                    "\r\n\r\nthis is a test",
                    "this is a test\r\n\r\n",
                    "this is a test\r\n\r\nthis si a test",
                    "this is a test\r\n\r\nthis si a test\r\n\r\n",
                    "this is a test\n\r\nthis si a test",
                    "this is a test\r\nthis si a test\r\n\r\n",
                    "this is a test"
                };
      for (String s: ss) {
        System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
      }

    }
}

Ответ 6

Использование Knuth–Morris–Pratt algorithm является наиболее эффективным способом.

StreamSearcher.java является его реализацией и является частью проекта Twitter elephant-bird.

Не рекомендуется включать эту библиотеку, так как она довольно полезна для использования только одного класса.

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher
{
    private byte[] pattern_;
    private int[] borders_;

    // An upper bound on pattern length for searching. Results are undefined for longer patterns.
    @SuppressWarnings("unused")
    public static final int MAX_PATTERN_LENGTH = 1024;

    StreamSearcher(byte[] pattern)
    {
        setPattern(pattern);
    }

    /**
     * Sets a new pattern for this StreamSearcher to use.
     *
     * @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
     */
    public void setPattern(byte[] pattern)
    {
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
    }

    /**
     * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
     * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
     * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
     * another reasonable default, i.e. leave the stream unchanged.
     *
     * @return bytes consumed if found, -1 otherwise.
     */
    long search(InputStream stream) throws IOException
    {
        long bytesRead = 0;

        int b;
        int j = 0;

        while ((b = stream.read()) != -1)
        {
            bytesRead++;

            while (j >= 0 && (byte) b != pattern_[j])
            {
                j = borders_[j];
            }
            // Move to the next character in the pattern.
            ++j;

            // If we've matched up to the full pattern length, we found it.  Return,
            // which will automatically save our position in the InputStream at the point immediately
            // following the pattern match.
            if (j == pattern_.length)
            {
                return bytesRead;
            }
        }

        // No dice, Note that the stream is now completely consumed.
        return -1;
    }

    /**
     * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
     * and aids in implementation of the Knuth-Moore-Pratt string search.
     * <p>
     * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    private void preProcess()
    {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length)
        {
            while (j >= 0 && pattern_[i] != pattern_[j])
            {
                j = borders_[j];
            }
            borders_[++i] = ++j;
        }
    }
}

Ответ 7

package org.example;

import java.util.List;

import org.riversun.finbin.BinarySearcher;

public class Sample2 {

    public static void main(String[] args) throws Exception {

        BinarySearcher bs = new BinarySearcher();

        // UTF-8 without BOM
        byte[] srcBytes = "Hello world.It a small world.".getBytes("utf-8");

        byte[] searchBytes = "world".getBytes("utf-8");

        List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);

        System.out.println("indexList=" + indexList);
    }
 }

поэтому это приводит к

indexList=[6, 25]

Итак, u может найти индекс байта [] в байте []

Пример здесь на Github at: https://github.com/riversun/finbin

Ответ 8

Скопировано почти идентично с java.lang.String.

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    byte first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first)
                ;
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                ;

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

Ответ 9

Для небольшого HTTP-сервера, над которым я сейчас работаю, я нашел следующий код, чтобы найти границы в запросе multipart/form-data. Надеялся найти лучшее решение здесь, но я думаю, я буду придерживаться его. Я думаю, что он так же эффективен, как и может (довольно быстро и использует не много барана). Он использует входные байты в качестве кольцевого буфера, считывает следующий байт, как только он не соответствует границе, и записывает данные после первого полного цикла в выходной поток. Конечно, он может быть изменен для массивов байтов, а не потоков, как задано в вопросе.

    private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
    {
        try
        {
            String n = "--"+boundary;
            byte[] bc = n.getBytes("UTF-8");
            int s = bc.length;
            byte[] b = new byte[s];
            int p = 0;
            long l = 0;
            int c;
            boolean r;
            while ((c = is.read()) != -1)
            {
                b[p] = (byte) c;
                l += 1;
                p = (int) (l % s);
                if (l>p)
                {
                    r = true;
                    for (int i = 0; i < s; i++)
                    {
                        if (b[(p + i) % s] != bc[i])
                        {
                            r = false;
                            break;
                        }
                    }
                    if (r)
                        break;
                    os.write(b[p]);
                }
            }
            os.flush();
            return true;
        } catch(IOException e) {e.printStackTrace();}
        return false;
    }