Technically, the fork and join might be 'a' solution, but I get the impression it would be a bad one:
From your description I still don't see the need to monitor your original classes' waiting/processing-states. As a matter of fact, I find it rather awkward the user needs to know and inquire about each and every thread/object that may call IO processes!
As I understand you what you need anyway is one class to monitor all IO calls. The IO-class(es) already have all the information needed, so shouldn't this be the place to model this feature? This way the user only needs to keep track of one class, and doesn't need to know about any other objects.
Since you mentioned calling system libraries you may not have the option to modify and use them for your monitoring purposes, but you can easily solve that by introducing a proxy object for your IO calls. Also create a IO-monitor object that just keeps track of whatever info you need. The proxy can just be a temporary object that gets instantiated on any IO call, reports to the monitor, then forwards the IO call to the actual target. Once the call returns, the proxy again reports to the monitor and returns to the caller. Since the 'reports' to the monitor object needn't be synchronized you can just fork a transition.
One other aspect I'd consider is whether this shouldn't be modeled as activity diagrams rather than state machines. When I'm talking state machine I am thinking of objects that don't really do a lot apart from changing their state. A window changing from custom size to minimized or maximized would be an example. What you are describing however is a system full of objects that don't do anything else but process - except for the synchronization bit. It's not like these objects are waiting for events to make them switch to any of a potentially large number of different states. Your states are basically defined by having completed a process (which, incidentally, is the way to convert activity diagrams into state machines)
It's not like there's a big difference, but activity diagrams might be better suited to visualize your concept.
Mind you, my experience with multi-threaded system modelling is limited to say the least. Maybe others have better suggestions on this issue.