Skip to main content

Accuracy of field statistics in IDEA?

Dear Brian,
my question is about the specific accuracy of the field statistics (sum, numeric field) being calculated and shown in IDEA.
 
Suppose I have a large sample dataset with two columns, one name and the other some value. I create this sample dataset using Python:
import csv
import pandas as pd
counter=range(0, 5000000)
with open(r'C:\myfile\Test.csv', "w", newline='', encoding="utf-8") as fo:
    cwriter = csv.writer(fo, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cwriter.writerow(['ColumnName','Value'])
    for i in counter:
        cwriter.writerow(['A',500.01])
    cwriter.writerow(['B',500.49])
    cwriter.writerow(['C',1000.05])
df=pd.read_csv(r'C:\myfile\Test.csv', sep=";", encoding="utf-8")
 
print(df['Value'].sum())
 
This generates a csv with 5 million records with each having a value of 500.01. Then one line is additionally added with the value of 500.49 and one with 1000.05. So in total this would be 5000000*500,01+500,49+500,05 which gives 2500051000.54. And indeed the output of print(df['Value'].sum()) in Python gives me 2500051500.54. And if I set the options of decimals to maximum I can see 2500051500.5400214. So there is some calculation inaccuracy.
 
When I now import this csv into IDEA, specifying in the import wizard that "." is the decimal it works and I can see the dataset. Then I click on field statistics and the field statistics are being calculated. They show the number 2.500.051.500,66. So there is some deviation. I was a bit surprised when I saw this deviation. The inaccuracy in the last two digits (the ,66) is not correct. ,54 is the true value. So what is the issue here? Did I do something wrong or is this just the "typical" compuation inaccuracy due to the large volume of data?
I am using IDEA Client Version: 11.2.2 on Windows 10 Enterprise LTSC x64.
 

Brian Element Fri, 02/17/2023 - 14:05

This is a "typical" because of the computation inaccuracy.  In IDEA I created a running total.  You can see it starts to diverge at record 6593.

By the end this divergence gives you the value that you found.

Brian Element Fri, 02/17/2023 - 14:09

The problem is the decimals.  If you want to try an experiment create a virtual field and multiple your original value by 100 to remove the decimals then do the field stats and you will see you get the correct sum.