How to handle '.data' files?

Hi! Can anyone explain to me how do I handle this format of data, so that I can extract information of it using ‘Pandas’?

I found this data set at the following link: UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set

It might depend on what the file contains (and the format it’s in). But read_csv should work here.

Based on the file names you share above, it seems the data and the column names are stored separately. So, when using read_csv you might have to appropriately define the header argument. You can check out the read_csv documentation for more details.

I did a quick test with read_csv and was able to load at least the file without issue.

1 Like

Hi thanks for answering! Just one more question, I thought all csv files ended with ‘.csv’, is it wrong? If so, how do I identify the file type?

File formats can be anything you want. Whether something is a csv depends on whether the contents are separated by commas, not by what extension it is saved in. These formats are just convenient to inform your computer about which tools you want to open the files with. Like .py .h .rb are associated with different languages and will get different treatment by IDEs, .doc, .docx will prompt Microsoft Word, .pdf prompts Adobe Reader.

But I may be wrong, I have a sense that the format has more meaning than just “inform which application opens it”, such as how/whether the data is compressed too.

1 Like

Ah, now I see it, guess I was mislead by the usual ‘.csv’ hahah, thank you for your explanation!

Yes this is one of those mindblowing moments for me regarding general computer knowledge :exploding_head: . I amazed at people who work with binary data or hex editors to manipulate images/videos.

As @hanqi mentioned, it varies from file to file and the encoding/encryption used if any. From a cyber student’s point of view, one way to determine this is to use the magic bytes in the file header–for example 4D 5A for a .exe file, FF D8 FF E0 for a .jpeg. .csv files do not have a magic number but can be either categorized as plaintext or octet stream as described here. Here is also another writeup on .data, which may consist of CSV, XLS and XML formats.

You can read more about it in its use case of file carving and forensic work. Then again the file headers can be changed to conceal data (can be detected and reversed easily) or bit rotations can be done for more sophisticated attacks in order to prevent forensic software and file carvers from detecting the evidence.

Here is a website with the most common magic bytes for various file formats.

Quite a good explanation of file formats here:

Happy to share more knowledge in this aspect if you have further questions.

1 Like

I loved this video! Thank you for sharing all this information, they were really interesting!

1 Like