Silviu-Marius Ardelean's blog

a software engineer's web log

File size fast detection

Many times in our job, we need to work with files and need to know file properties.
One of the most important properties is file size. Of course, there are a lot of API that allows finding this property, but most of them needs additional file operations: open file, find file size and close file.

A direct and fast way in order to detect the file size without these operations means the CRT run-time library’s function _tstat64() and stuff.

In the header file or on .cpp file please add next macros definitions:

Then, write next function:

If you’re using WinAPI there is an even faster way in order to get file size.

Finally, call these functions wherever you need.

Silviu Ardelean

Software Engineer

More Posts - Website

Follow Me:
TwitterFacebookPinterest

Share

23 Responses to “File size fast detection”


  1. Marius

    Or, you can use _tstat and _tstat64 and then you don’t have to make any change for ANSI or UNICODE.


  2. Sorin

    Ok, but because of long type limit, your file must be smaller then 2GB.

  3. Absolutely. Nobody stops you to replace long with long long or unsigned long.

  4. Somebody wonders why should I returned -1 in an unsigned long return function.

    Because even in that way I can easy test the validity of returned value:
    unsigned long size = GetFileSize(…);
    if(size == (unsigned long)-1) …


  5. Mihai

    What do you do if the file size is 4294967295 ?

  6. Even in that exquisite case, I see no issues even on win32 machine. The execution code follows true branch.
    English next time, please.


  7. George

    Why do you assign the result to an unsigned long in that code, when the function returns long long? And what’s the point of comparing to (unsigned long)-1?

  8. @George: Please read my comments. I’m talking about the same function but if should return unsigned long instead of long long.
    (unsigned long)-1 is equal with 4294967295 (0xFFFFFFFF).
    WinSDK concludes: #define INVALID_FILE_SIZE ((DWORD)0xFFFFFFFF)


  9. Dan

    How much faster is this compared to GetFileSize?
    Any numbers?

  10. @Dan: You miss some things.
    I not talking about WinApi functions that needs a handle (“the CRT run-time library’s function“). Please check article’s tags.
    I just need to pass the path without any other additional calls: fopen()+fclose() or CreateFile()+CloseHandle() operations.
    Even if I should talk about this function, this function returns a DWORD which is a unsigned long, too.
    Maybe you mean GetFileSizeEx().


  11. Mihai

    And how do you know if 4294967295 is a correct file size or an invalid file size?
    unsigned long size = GetFileSize(…);
    if(size == (unsigned long)-1)
    {
    //now what? // silviu – invalid file size
    }

  12. This code is just a sample that is gives you an idea.
    One friend said: “the help files, books or FAQs are not perfect. It doesn’t fit everywhere and are not designed for simple copy/paste without using any neurons”.

    In that hypothetical case you get invalid file size, as GetFileSize() result, too. That was the reason of changing from unsigned long to long long.


  13. Dan

    Re GetFileSize:
    From msdn, GetFileSize has this signature, which allows you to get the full 64 bit of the size (see the lpFileSizeHigh argument).

    DWORD WINAPI GetFileSize(
    __in HANDLE hFile,
    __out_opt LPDWORD lpFileSizeHigh
    );

    It is not mandatory to use GetFileSizeEx, it’s recommended so that you avoid ambiguous results when one passes NULL for lpFileSizeHigh.

    Re Fast:
    Anyways, my question was whether you have tested how fast various approaches really are (open/tell/close, stat, GetFileSize?

    Not suprisingly, a simple test of 100000 filesize reads with 3 methods shows that actuall stat is not faster then fopen/ftell/fclose:

    Runtime with open/close: 2823 ms
    Runtime with stat: 3261 ms
    Runtime with GetFileSize: 62 ms

    I’m talking runtimes here, and I really hope that’s what you meant when you said fast.

    ——
    Code here:

    #include “stdafx.h”
    #include
    #include

    #define TEST_FILE (“c:\\P5K-1201.ROM”)

    int main(int argc, char * argv[])
    {
    // Measure runtime with fopen/ftell/fclose
    int startTime = GetTickCount();
    for (int i = 0; i < 100000; i++)
    {
    INT64 fileSize;

    FILE * fileHandle = fopen(TEST_FILE, "rb");
    if (fileHandle != NULL)
    {
    fseek(fileHandle, 0, SEEK_END);
    fileSize = _ftelli64(fileHandle);
    fclose(fileHandle);
    }
    }

    int runTime = GetTickCount() – startTime;
    printf("Runtime with open/close: %d ms\n", runTime);

    // Measure runtime with Stat
    startTime = GetTickCount();
    for (int i = 0; i < 100000; i++)
    {
    struct _stat64 fileStat;
    INT64 fileSize;
    if (_stat64(TEST_FILE, &fileStat) == 0)
    {
    fileSize = fileStat.st_size;
    }
    }
    runTime = GetTickCount() – startTime;
    printf("Runtime with stat: %d ms\n", runTime);

    // Measure runtime with GetFileSize
    startTime = GetTickCount();
    for (int i = 0; i < 100000; i++)
    {
    DWORD sizeLow;
    DWORD sizeHigh;
    INT64 fileSize;

    sizeLow = GetFileSize(TEST_FILE, &sizeHigh);
    fileSize = sizeLow + (sizeHigh << sizeof(DWORD));
    }
    runTime = GetTickCount() – startTime;
    printf("Runtime with GetFileSize: %d ms", runTime);

    return 0;
    }

    Btw, CRT stands for C Runtime Library, no need to say CRT runtime.

  14. I got similarly results with an invalid handle file (your file path on my station):
    Runtime with open/close: 2044 ms
    Runtime with stat: 4134 ms
    Runtime with GetFileSize: 63 ms //what handler are you passing here?

    What are you computing, if you pass invalid file handler? 🙂
    I recommend you using QueryPerformanceFrequency() instead of GetTickCount(), next time.


  15. ESul

    @Dan: No offense but that code does not work as advertised. I tried to duplicate your results – of well… Actually do the test better (small changes to the code were required). Here are my results:

    With possible cache interference (no restart, multiple runs before this actual test):
    Runtime with GetFileSize: 359 ms check: 1231369886
    Runtime with open/close: 374 ms check: 1231369886
    Runtime with stat: 203 ms check: 1231369886
    Runtime with FindFirst: 141 ms check: 1231369886

    After restart:
    Runtime with GetFileSize: 1950 ms check: 1231369886
    Runtime with open/close: 1919 ms check: 1231369886
    Runtime with stat: 234 ms check: 1231369886
    Runtime with FindFirst: 125 ms check: 1231369886

    Files: DIR C:\winblows\system32 /A-D /B > C:\filelist.txt
    File count: 2970
    Check==Total size


  16. Dan

    @ESul: Nice to see someone that actually likes numbers, before assuming something is fast or not.

    Sorry about the GetFileSize issues (first argument is the file handle, not the name, duh, and missed the cast).

    My test was a warm cache test (that’s why I do 100000 calls on the same file).

    Here are the warm cache results on my box after the fixes:
    Runtime with open/close: 2637 ms
    Runtime with stat: 3011 ms
    Runtime with GetFileSize: 2230 ms
    Runtime 2328 (1086374927) files with open/close: 110 ms
    Runtime 2328 (1086374927) files with stat: 78 ms
    Runtime 2328 (1086374927) files with GetFileSize: 109 ms

    Runtime for many files does a FindFirstFile/FindNextFile loop on 2328 files (like your test).

    The runtime will btw differ between different versions of CRT (VS2005, VS2008, VS2010, mingw) for stat, fopen, etc., and also between different versions of the OS.

    But thanks again for your numbers.
    Just for reference: My box is running Intel Core2Quad with Win Server 2008 on an Intel 80Gb SSD on SATA2, I’m using VS2008 to build the tests (on /O2, multi byte).

    Cheers,
    Dan


  17. ESul

    @Dan: I always like to see numbers!

    I only ran the test in Debug, so I plan to check the results in Release (but without reboots, thank you for your kind understanding) and post back if they differ, but I don’t really expect to see a big difference on I/O performance.

    On the other hand – that SSD thingie probably plays it’s own part in your results! 😛


  18. AndreiDF

    s/GetFilesSizeFastest/GetFileSizeFastest/

    Now suppose that I create a program that asks for a path and returns the file size. And the client enters a path containing a wildcard character(*,?). Will your function work OK? Does it return -1 saying that the file does not exist?

  19. @AndreiDF: I known that Microsoft doesn’t offer you a direct function in order to test file name validity.
    So, you have two options:
    1. You have to do it on your own. Here you have a reference of invalid chars.
    You anticipated me, a little bit. In my free time I’m working to an application that allow defining a prefix file name and this needs a validity test that I’m intending to use RegEx (some kind of). You can do it without RegEx but is not professional.
    This function could be the subject of other FAQ, too. Nobody stops you calling this test before this function or even in this function. You have to do it even if you are using fopen()+fclose(), GetFileSize().
    2. Replace invalid chars before you get the file path.

    Fortunately first function works fine even in that way. So, _tsstat() has an important advantage because analyzes the validity of input string.
    For second function, until I add a RegEx FAQ of how to test invalid file name, I add a dedicated function for no conventional people.
    long long GetFilesSizeFastest(const TCHAR* szFilePath)
    {
    if (!szFilePath || !szFilePath[0])
    return -1;
    if (_tcsstr(szFilePath, _T(“*”)) || _tcsstr(szFilePath, _T(“?”)))
    return -1148479716854183611LL; // for conventional people, replace with “return -1” 😉

    WIN32_FIND_DATA sFileData;
    HANDLE hFind = FindFirstFile(szFilePath, &sFileData);
    if (hFind == INVALID_HANDLE_VALUE)
    return -1;

    FindClose(hFind);

    return (sFileData.nFileSizeHigh * (MAXDWORD+1LL)) + sFileData.nFileSizeLow;
    }

    Especially for you, I have to repeat an important remark: “the help files, books or FAQs are not perfect. It doesn’t fit everywhere and are not designed for simple copy/paste without using any neurons”. I’m sure, you don’t like blind copy+paste. 😛


  20. AndreiDF

    You should specify it with RED that the code you wrote won’t work. From the text you wrote I understand that there is a function, I write it, I call it and everything is fine.

    You shouldn’t present code if it is just a bunch of untested suppositions. You should “use your neurons” and come up with a function that does what it says that it does. You are not presenting a topic by ignoring unimportant stuff. You are saying: “Finally, call it wherever you need.”, not: “this function will not work in some cases, here are the cases, here are some solutions etc.”.

    Did I just heard a solution implying regex? There is a saying: «Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.»

    What the hell does this constant is supposed to mean?
    return -1148479716854183611LL; // for conventional people, replace with “return -1″

    Why do you use _tcsstr instead of _tcschr?

    Do you want to become like Herbert Schildt? [1], [2], [3] His books are known to be easy to understand but he presents false, deformed information. As you can see, especially from [2], he wrote a lot of books but they are not recommended. No serious reviewer said: readers should use their neurons.

    It is very easy to say “are not designed for simple copy/paste without using any neurons”, “Nobody stops you …” to do X and Y. If you want to help someone, post some good information, not a buggy code. Even better, don’t post code if you don’t trust what you wrote. Just say: consider using _tstat() for finding size. If you’re posting code and you’re saying that it’s fast, prove it.

    Why do you think Windows does not provide a fast way to find the file size?

    And BTW, the size you get from stat() is not always updated. Sometimes you have to open a file and move to the end in order for Windows to update that info. I’ve seen this on big logfiles that were written constantly but were not closed for a long time

    Especially for you, I have to say: there is a difference between perfection and good with minor mistakes.

    Although I wrote s/GetFilesSizeFastest/GetFileSizeFastest/, you still haven’t fixed that function name.

    [1] http://en.wikipedia.org/wiki/Herbert_Schildt#Reception
    [2] http://www.accu.informika.ru/accu/bookreviews/public/reviews/cp/cp000413.htm
    [3] http://www.lysator.liu.se/c/schildt.html

  21. Andrei, thanks you for your feedback. The article will be completed with validity test after doing that regex using function.
    Why do you say that using regular expressions is a second issue?
    That constant is just a dedication…
    Indeed it is more correct _tcschr() instead of _tcsstr(), but in my case I get same results.
    No, I don’t want to write any book. I let you do. You have a deeper experience. 😛
    Sorry, if somebody passes a stupid input I’m not the only guilty. As I said I’ll try to help him.
    Unfortunately, Microsoft doesn’t care about this kind of tests in order to provide a winapi function that returns TRUE or FALSE.

    Of course that Windows provides a fast way to find file size, but not is the function that somebody should expect (GetFileSize()).
    Initially, the FAQ was focust on C/C++ only… not on WinAPI. Some benchmarks proved me to introduce a version that use winapi.
    I don’t know deeply details about stat() because I’m not an low level specialist. I’m glad to hear that from you. Thank you.


  22. AndreiDF

    RegEx is slow. Better take a look at stat64.c in crt\src.
    Silviu: Thank you for idea. It’s good to know.
    “Same results” does not imply “fastest”.
    Silviu: You got the point. 🙂

    The point was not in writing a book, it was in producing large quantities of low quality material.
    Silviu: You asked about writing book. Not me.

    “Sorry, if somebody passes a stupid input I’m not the only guilty” It could be the user, not the developer. It’s ok for you to do return an _incorrect_ answer because someone mistyped a character?
    Silviu: Witch simple user uses this article’s information?
    No, is not right to return un incorrect result in that situation… And now we’re returning to regex discussion.
    🙂

  23. Reviewing the comments, I see a miss understanding (It missed me, too).
    Title means “file size fast detection” not “the fastest file size detection“. There is a big difference between these ideas.
    The initial idea of this FAQ/article was using stat(). I started from the inconvenience of calling additional functions for other way of finding file size (fopen()+fclose() / CreateFile()+CloseHandler(), etc). I wanted a way of passing file path and getting size.
    Fortunately, your feedback helped me to increase article’s quality and to satisfy fastest expectations, too.
    Thanks to everyone. Special regards to ESul and Andrei. 🙂

Leave a Reply

*