Масштабирование и привязка к логарифмически нормальному распределению с использованием логарифмической оси в python

У меня есть логарифмически нормальный распределенный набор образцов. Я могу визуализировать образцы, используя гистограмму с линейной или логарифмической осью х. Я могу выполнить привязку к гистограмме, чтобы получить PDF, а затем масштабировать ее до гистограммы на графике с линейной осью х, см. Также этот ранее заданный вопрос.

Я, однако, не смог правильно построить PDF в графике с логарифмической осью x.

К сожалению, это не только проблема с масштабированием области PDF в гистограмме, но и PDF файл сдвигается влево, как видно из следующего графика.

Теперь мой вопрос: что я здесь делаю неправильно? Используя CDF для построения ожидаемой гистограммы, как предложено в этом ответе, работает. Я просто хотел бы знать, что я делаю неправильно в этом коде, так как в моем понимании он тоже должен работать.

Это код python (мне жаль, что он довольно длинный, но я хотел опубликовать "полную автономную версию" ):

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

# generate log-normal distributed set of samples
np.random.seed(42)
samples   = np.random.lognormal( mean=1, sigma=.4, size=10000 )

# make a fit to the samples
shape, loc, scale = scipy.stats.lognorm.fit( samples, floc=0 )
x_fit       = np.linspace( samples.min(), samples.max(), 100 )
samples_fit = scipy.stats.lognorm.pdf( x_fit, shape, loc=loc, scale=scale )

# plot a histrogram with linear x-axis
plt.subplot( 1, 2, 1 )
N_bins = 50
counts, bin_edges, ignored = plt.hist( samples, N_bins, histtype='stepfilled', label='histogram' )
# calculate area of histogram (area under PDF should be 1)
area_hist = .0
for ii in range( counts.size):
    area_hist += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# oplot fit into histogram
plt.plot( x_fit, samples_fit*area_hist, label='fitted and area-scaled PDF', linewidth=2)
plt.legend()

# make a histrogram with a log10 x-axis
plt.subplot( 1, 2, 2 )
# equally sized bins (in log10-scale)
bins_log10 = np.logspace( np.log10( samples.min()  ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )
# calculate area of histogram
area_hist_log = .0
for ii in range( counts.size):
    area_hist_log += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# get pdf-values for log10 - spaced intervals
x_fit_log       = np.logspace( np.log10( samples.min()), np.log10( samples.max()), 100 )
samples_fit_log = scipy.stats.lognorm.pdf( x_fit_log, shape, loc=loc, scale=scale )
# oplot fit into histogram
plt.plot( x_fit_log, samples_fit_log*area_hist_log, label='fitted and area-scaled PDF', linewidth=2 )

plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
plt.legend()
plt.show()

Обновление 1:

Я забыл упомянуть о версиях, которые я использую:

python      2.7.6
numpy       1.8.2
matplotlib  1.3.1
scipy       0.13.3

Обновление 2:

Как указано @Christoph и @zaxliu (благодаря обоим), проблема заключается в масштабировании PDF. Он работает, когда я использую те же ячейки, что и для гистограммы, как в решении @zaxliu, но у меня все еще есть некоторые проблемы при использовании более высокого разрешения для PDF (как в моем примере выше). Это показано на следующем рисунке:

Код для фигуры с правой стороны (я не учитывал материал генерации импорта и данных, который вы можете найти в приведенном выше примере):

# equally sized bins in log10-scale
bins_log10 = np.logspace( np.log10( samples.min()  ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )

# calculate length of each bin (required for scaling PDF to histogram)
bins_log_len = np.zeros( bins_log10.size )
for ii in range( counts.size):
    bins_log_len[ii] = bin_edges[ii+1]-bin_edges[ii]

# get pdf-values for same intervals as histogram
samples_fit_log = scipy.stats.lognorm.pdf( bins_log10, shape, loc=loc, scale=scale )

# oplot fitted and scaled PDF into histogram
plt.plot( bins_log10, np.multiply(samples_fit_log,bins_log_len)*sum(counts), label='PDF using histogram bins', linewidth=2 )

# make another pdf with a finer resolution
x_fit_log       = np.logspace( np.log10( samples.min()), np.log10( samples.max()), 100 )
samples_fit_log = scipy.stats.lognorm.pdf( x_fit_log, shape, loc=loc, scale=scale )
# calculate length of each bin (required for scaling PDF to histogram)
# in addition, estimate middle point for more accuracy (should in principle also be done for the other PDF)
bins_log_len       = np.diff( x_fit_log )
samples_log_center = np.zeros( x_fit_log.size-1 )
for ii in range( x_fit_log.size-1 ):
    samples_log_center[ii] = .5*(samples_fit_log[ii] + samples_fit_log[ii+1] )

# scale PDF to histogram
# NOTE: THIS IS NOT WORKING PROPERLY (SEE FIGURE)
pdf_scaled2hist = np.multiply(samples_log_center,bins_log_len)*sum(counts)

# oplot fit into histogram
plt.plot( .5*(x_fit_log[:-1]+x_fit_log[1:]), pdf_scaled2hist, label='PDF using own bins', linewidth=2 )

plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
plt.legend(loc=3)

Ответ 1

Из того, что я понял в исходном ответе @Warren Weckesser, что вы отнесены к, "все, что вам нужно сделать", это:

напишите приближение cdf(b) - cdf(a) как cdf(b) - cdf(a) = pdf(m)*(b - a)где m -, скажем, середина интервала [a, b]

Мы можем попытаться выполнить его рекомендацию и построить два способа получения pdf-значений на основе центральных точек бункеров:

с функцией PDF
с функцией CDF:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats 

# generate log-normal distributed set of samples
np.random.seed(42)
samples = np.random.lognormal(mean=1, sigma=.4, size=10000)
N_bins = 50

# make a fit to the samples
shape, loc, scale = stats.lognorm.fit(samples, floc=0)
x_fit       = np.linspace(samples.min(), samples.max(), 100)
samples_fit = stats.lognorm.pdf(x_fit, shape, loc=loc, scale=scale)

# plot a histrogram with linear x-axis
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,5), gridspec_kw={'wspace':0.2})
counts, bin_edges, ignored = ax1.hist(samples, N_bins, histtype='stepfilled', alpha=0.4,
                                      label='histogram')

# calculate area of histogram (area under PDF should be 1)
area_hist = ((bin_edges[1:] - bin_edges[:-1]) * counts).sum()

# plot fit into histogram
ax1.plot(x_fit, samples_fit*area_hist, label='fitted and area-scaled PDF', linewidth=2)
ax1.legend()

# equally sized bins in log10-scale and centers
bins_log10 = np.logspace(np.log10(samples.min()), np.log10(samples.max()), N_bins)
bins_log10_cntr = (bins_log10[1:] + bins_log10[:-1]) / 2

# histogram plot
counts, bin_edges, ignored = ax2.hist(samples, bins_log10, histtype='stepfilled', alpha=0.4,
                                      label='histogram')

# calculate length of each bin and its centers(required for scaling PDF to histogram)
bins_log_len = np.r_[bin_edges[1:] - bin_edges[: -1], 0]
bins_log_cntr = bin_edges[1:] - bin_edges[:-1]

# get pdf-values for same intervals as histogram
samples_fit_log = stats.lognorm.pdf(bins_log10, shape, loc=loc, scale=scale)

# pdf-values for centered scale
samples_fit_log_cntr = stats.lognorm.pdf(bins_log10_cntr, shape, loc=loc, scale=scale)

# pdf-values using cdf 
samples_fit_log_cntr2_ = stats.lognorm.cdf(bins_log10, shape, loc=loc, scale=scale)
samples_fit_log_cntr2 = np.diff(samples_fit_log_cntr2_)

# plot fitted and scaled PDFs into histogram
ax2.plot(bins_log10, 
         samples_fit_log * bins_log_len * counts.sum(), '-', 
         label='PDF with edges',  linewidth=2)

ax2.plot(bins_log10_cntr, 
         samples_fit_log_cntr * bins_log_cntr * counts.sum(), '-', 
         label='PDF with centers', linewidth=2)

ax2.plot(bins_log10_cntr, 
         samples_fit_log_cntr2 * counts.sum(), 'b-.', 
         label='CDF with centers', linewidth=2)


ax2.set_xscale('log')
ax2.set_xlim(bin_edges.min(), bin_edges.max())
ax2.legend(loc=3)
plt.show()

Вы можете видеть, что как первые (используя pdf), так и второй (с использованием cdf) методы дают почти одинаковые результаты, и оба точно не соответствуют PDF, рассчитанным с краями бункеров.

Если вы увеличите масштаб, вы ясно увидите разницу:

Теперь можно задать вопрос: какой из них использовать? Я думаю, ответ будет зависеть, но если мы посмотрим на кумулятивные вероятности:

print 'Cumulative probabilities:'
print 'Using edges:         {:>10.5f}'.format((samples_fit_log * bins_log_len).sum())
print 'Using PDF of centers:{:>10.5f}'.format((samples_fit_log_cntr * bins_log_cntr).sum())
print 'Using CDF of centers:{:>10.5f}'.format(samples_fit_log_cntr2.sum())

Вы можете увидеть, какой метод ближе к выходу 1.0:

Cumulative probabilities:
Using edges:            1.03263
Using PDF of centers:   0.99957
Using CDF of centers:   0.99991

CDF, кажется, дает самое близкое приближение.

Это был длинный, но я надеюсь, что это имеет смысл.

Update:

Я скорректировал код, чтобы проиллюстрировать, как можно сгладить линию PDF. Обратите внимание на переменную s, которая определяет, насколько гладкой будет линия. Я добавил суффикс _s к переменным, чтобы указать, где должны выполняться корректировки.

# generate log-normal distributed set of samples
np.random.seed(42)
samples = np.random.lognormal(mean=1, sigma=.4, size=10000)
N_bins = 50

# make a fit to the samples
shape, loc, scale = stats.lognorm.fit(samples, floc=0)

# plot a histrogram with linear x-axis
fig, ax2 = plt.subplots()#1,2, figsize=(10,5), gridspec_kw={'wspace':0.2})

# equally sized bins in log10-scale and centers
bins_log10 = np.logspace(np.log10(samples.min()), np.log10(samples.max()), N_bins)
bins_log10_cntr = (bins_log10[1:] + bins_log10[:-1]) / 2

# smoother PDF line
s = 10 # mulpiplier to N_bins - the bigger s is the smoother the line
bins_log10_s = np.logspace(np.log10(samples.min()), np.log10(samples.max()), N_bins * s)
bins_log10_cntr_s = (bins_log10_s[1:] + bins_log10_s[:-1]) / 2

# histogram plot
counts, bin_edges, ignored = ax2.hist(samples, bins_log10, histtype='stepfilled', alpha=0.4,
                                      label='histogram')

# calculate length of each bin and its centers(required for scaling PDF to histogram)
bins_log_len = np.r_[bins_log10_s[1:] - bins_log10_s[: -1], 0]
bins_log_cntr = bins_log10_s[1:] - bins_log10_s[:-1]

# smooth pdf-values for same intervals as histogram
samples_fit_log_s = stats.lognorm.pdf(bins_log10_s, shape, loc=loc, scale=scale)

# pdf-values for centered scale
samples_fit_log_cntr = stats.lognorm.pdf(bins_log10_cntr_s, shape, loc=loc, scale=scale)

# smooth pdf-values using cdf 
samples_fit_log_cntr2_s_ = stats.lognorm.cdf(bins_log10_s, shape, loc=loc, scale=scale)
samples_fit_log_cntr2_s = np.diff(samples_fit_log_cntr2_s_)

# plot fitted and scaled PDFs into histogram
ax2.plot(bins_log10_cntr_s, 
         samples_fit_log_cntr * bins_log_cntr * counts.sum() * s, '-', 
         label='Smooth PDF with centers', linewidth=2)

ax2.plot(bins_log10_cntr_s, 
         samples_fit_log_cntr2_s * counts.sum() * s, 'k-.', 
         label='Smooth CDF with centers', linewidth=2)

ax2.set_xscale('log')
ax2.set_xlim(bin_edges.min(), bin_edges.max())
ax2.legend(loc=3)
plt.show)

Это создает этот график:

Если вы увеличите размер сглаженной версии и не сглаживаете, вы увидите следующее:

Надеюсь, что это поможет.

Ответ 2

Поскольку я столкнулся с той же проблемой и понял ее, я хотел объяснить, что происходит, и предоставить другое решение для исходного вопроса.

Когда вы делаете гистограмму с логарифмическими ячейками, это эквивалентно изменению переменных , где x - ваши исходные образцы (или сетка, которую вы используете для их построения), а t - новая переменная, относительно которой ячейки расположены линейно. Поэтому PDF, который фактически соответствует гистограмме,

Мы все еще работаем с переменными x как входы в PDF, так что это становится

Вам нужно умножить свой PDF на x!

Это фиксирует форму PDF, но нам еще нужно масштабировать PDF так, чтобы область под кривой была равна гистограмме. Фактически область под PDF не равна единице, так как мы интегрируем по x и

так как мы имеем дело с логнормальной переменной. Поскольку, согласно scipy documentation, параметры распределения соответствуют shape = sigma и scale = exp(mu), мы можем легко вычислить правую часть в ваш код scale * np.exp(shape**2/2.).

Фактически, одна строка кода исправляет ваш исходный script, умножая вычисленные значения PDF на x и деля по вычисленной выше области:

samples_fit_log *= x_fit_log / (scale * np.exp(shape**2/2.))

Результат на следующем рисунке:

В качестве альтернативы вы можете изменить определение гистограммы "area", объединив гистограмму в пространстве журналов. Помните, что в лог-пространстве (переменная t) PDF имеет область 1. Таким образом, вы можете пропустить коэффициент масштабирования и заменить строку выше:

area_hist_log = np.dot(np.diff(np.log(bin_edges)), counts)
samples_fit_log *= x_fit_log

Это последнее решение может быть предпочтительным, поскольку оно не зависит от какой-либо информации о распределении. Это применимо к любому распределению, а не только к log-normal.

Для справки, вот оригинал script с добавленной линией:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

# generate log-normal distributed set of samples
np.random.seed(42)
samples   = np.random.lognormal( mean=1, sigma=.4, size=10000 )

# make a fit to the samples
shape, loc, scale = scipy.stats.lognorm.fit( samples, floc=0 )
x_fit       = np.linspace( samples.min(), samples.max(), 100 )
samples_fit = scipy.stats.lognorm.pdf( x_fit, shape, loc=loc, scale=scale )

# plot a histrogram with linear x-axis
plt.subplot( 1, 2, 1 )
N_bins = 50
counts, bin_edges, ignored = plt.hist( samples, N_bins, histtype='stepfilled', label='histogram' )
# calculate area of histogram (area under PDF should be 1)
area_hist = .0
for ii in range( counts.size):
    area_hist += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# oplot fit into histogram
plt.plot( x_fit, samples_fit*area_hist, label='fitted and area-scaled PDF', linewidth=2)
plt.legend()

# make a histrogram with a log10 x-axis
plt.subplot( 1, 2, 2 )
# equally sized bins (in log10-scale)
bins_log10 = np.logspace( np.log10( samples.min()  ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )
# calculate area of histogram
area_hist_log = .0
for ii in range( counts.size):
    area_hist_log += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# get pdf-values for log10 - spaced intervals
x_fit_log       = np.logspace( np.log10( samples.min()), np.log10( samples.max()), 100 )
samples_fit_log = scipy.stats.lognorm.pdf( x_fit_log, shape, loc=loc, scale=scale )
# scale pdf output:
samples_fit_log *= x_fit_log / (scale * np.exp(shape**2/2.))
# alternatively you could do:
#area_hist_log = np.dot(np.diff(np.log(bin_edges)), counts)
#samples_fit_log *= x_fit_log

# oplot fit into histogram
plt.plot( x_fit_log, samples_fit_log*area_hist_log, label='fitted and area-scaled PDF', linewidth=2 )

plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
plt.legend()
plt.show()

Ответ 3

Как указано в @Christoph, проблема заключается в том, как вы масштабируете выборочный pdf.

Поскольку pdf - плотность плотности вероятности, если вы хотите ожидаемую частоту в бункере, вы должны сначала увеличить плотность на длину бункера, чтобы получить приблизительную вероятность того, что образец упадет в этом бункере, тогда вы можете умножить эта вероятность на общее число выборок для оценки количества выборок, которые будут попадать в этот ящик.

Другими словами, каждый лоток должен масштабироваться неравномерно в логарифмическом масштабе, тогда как вы масштабируете их равномерно с помощью "области под гистойкой". В качестве исправления вы можете сделать следующее:

# make a histrogram with a log10 x-axis
plt.subplot( 1, 2, 2 )
# equally sized bins (in log10-scale)
bins_log10 = np.logspace( np.log10( samples.min()  ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )
# calculate length of each bin
len_bin_log = np.zeros([bins_log10.size,])
for ii in range( counts.size):
    len_bin_log[ii] = (bin_edges[ii+1]-bin_edges[ii])

# get pdf-values for log10 - spaced intervals
# x_fit_log       = np.logspace( np.log10( samples.min()), np.log10( samples.max()), N_bins )
samples_fit_log = scipy.stats.lognorm.pdf( bins_log10, shape, loc=loc, scale=scale )

# oplot fit into histogram
plt.plot(bins_log10 , np.multiply(samples_fit_log,len_bin_log)*sum(counts), label='fitted and area-scaled PDF', linewidth=2 )
plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
# plt.legend()
plt.show()

Также вы можете также рассмотреть возможность изменения метода масштабирования в линейном масштабе аналогичным образом. На самом деле вам не нужно накапливать область, просто несколько плотность по размеру бункера, а затем по общему количеству образцов.

Update

Мне приходило в голову, что мой нынешний подход для оценки вероятности в ящиках может быть не самым точным. Оценка с образцом в средней точке может быть более точной, поскольку кривые pdf являются вогнутыми.