Adam Pridgen, Triaging the Hard Disk: Classifying Specific Text Files

Digital forensics and electronic discovery (eDiscovery) are areas that will require significant attention and care when it comes to scaling analysis and tasks such as evidence collection. These areas are encountering these issues because storage devices can range from a single GB up to terabytes of storages. In general, analysts who perform eDiscovery tasks rely on tools that will scan and account for every byte of data on these disks. However, there are no known tools that allow analysts to scan through a file system and "triage" the disk in an effort to get a basic understanding of the content on the disk. This project is a stepping stone in that direction. The goal is to take a large set of files read random bytes of data from that file and attempt to classify that data as a particular file. This talk will discuss how the project has been executed, touching on the data, performing feature extraction, applying learning algorithms, and then identifying the most robust results.