Question: Extracting
Forums
I am trying to extract records that have a gap in a time series. I have twelve months worth of data that includes a unique identifier and a month field. The typical unique identifier will have 12 records--one for each of the twelve months. Some unique identifiers have fewer than 12 records because there is not a record for one or more of the months. Also, a few of the records have duplicates for one or more of the months. What I want to do is extract all records for unique identifiers that do not have a record for all 12 months. Does anyone have a suggestion on how to do this?
lelder
My data looks like this:
My data looks like this:
ID Month Payment Group
9654324934 1/1/2018 10.14 Red
9654324934 2/1/2018 17.85 Red
9654324934 3/1/2018 49.50 Red
9654324934 4/1/2018 95.98 Red
9654324934 5/1/2018 42.78 Red
9654324934 6/1/2018 60.88 Red
9654324934 7/1/2018 15.58 Red
9654324934 8/1/2018 71.82 Red
9654324934 9/1/2018 3.62 Red
9654324934 10/1/2018 37.11 Red
9654324934 11/1/2018 78.78 Red
9654324934 12/1/2018 4.71 Red
6922129544 1/1/2018 40.34 Green
6922129544 2/1/2018 25.51 Green
6922129544 3/1/2018 82.02 Green
6922129544 4/1/2018 71.10 Green
6922129544 6/1/2018 64.16 Green
Hi lelder,
Hi lelder,
You can summarize the data Analysis Tab then summarization under the categorize group.
You stated you have a month field but your example shows a date field. You will have to create a month field by appending a numeric or virtual numeric field (Data tab, fields group) prior to summarizing your data.
@month(month) - field name in your example
Now summarize your data by the ID field and total on the month field.
A perfect ID set will show a number of records (field created by the summarization feature) of 12 and a month total of 78.
You would then extract all ID's where the number of records is less than 12 or the number of records is 12 and the month field total is not equal to 78.
No_Of_Recs <12 .or. ( No_of_Recs = 12 .and. month_sum <>78)
This will give you a listing of ID's that are not exactly 12 different months of data.
You now complete a mathces only join of this list back to your original data set.
Primary File: Original Data
Secondary file: Extracted list from the summarization
Match: ID field in both files
Join type: matches only.
You will end up with a new file with records for ID's that either do not contain 12 months of data or have 12 or more records per ID because of duplicate records.
Cheers,
Steve
Steve,
Steve,
This is fantastic! Thanks for your time. Your instructions were great and I successfully produced a file. I wonder if there isn't still an issue with the duplicates. There are IDs that have greater than 12 records and that have a month_sum not equal 78 that could be missing a month. I don't think this scenario is too prevalent in my data though. Thanks again for your help.
lelder
Glad to help.
Glad to help.
One way to deal with the duplicates prior to executing my previous instructions is to summarize the original file first by ID and the month field you created. This will leave you with a file where you will only have one record per id and month, therefore you will never have and ID with greater than 12 records or an ID with 12 records that doesn't contain all 12 months. you can then summarize by just ID and extract No_of_Rec less than 12 then do your matches only join back to the original file.
Hi lelder and welcome to the
Hi lelder and welcome to the site.
Any chance you can post some example data, I am a visual person and seeing it will let me understand it better.
Thanks
Brian