Schema to structure a database with unstructured data

Date Name Customer Feedback Escalation
1/2/20 albert felt being scammed no money lost. no further action
2/2/20 brian lost money customer support escalate to supervisor. day 2 - supervisor reviewed the call. day 3 - give voucher
3/2/20 Charlie felt account compromised customer support lock the id. ask to migrate out the account and then asked charlie to recreate

I have a table like the one above where i keep all the data in an Excel file. Of course this is not sustainable and i want to ask what’s the best way to structure the data to analyze? As the data is not very structured as one important column is the Escalation. Is it something that MySQL can help or is it too unstructured that i need to learn NoSQL database for this purpose?

1 Like

hey @willx

A very delayed response.

What kind of analysis were/are you thinking for the Escalation column? Something that can be done in excel as well but you would rather do it in a database and make data storage more secure and efficient.

Or some higher level of data analytics.

Apologies for my sheepish questions.
Being a student at DQ just trying to grasp the fundamentals and functional aspects of data and its related questions.

oh, that’s great as i also want to test my problem statement.
How i get my data is from email where users would key in the free text. Someone would then copy and paste all the details in to the Escalation column. So the Escalation column is actually an unstructured column where the inputs are

  1. what happens here. what is the problem about. e.g what is the Modus Operandi
  2. what happens after
  3. who fixed the problem etc

does that make sense?

or should i break out the escalation column to a few more columns to structure it further like for Forums to have parent and child and subchild thread. not sure how i going to do that, but want more guidance as to which is the direction i should go for before i sink in further

Hey @willx

Heads up! this is not a solution and I don’t know if this post would make any sense to you.

While writing the first response, I had a question (directed to you) similar to the breakage of the escalation column.

While typing the question I realized:

  1. the segregation of escalation column will have to be based upon the assumption that the escalation is limited in terms of steps, as in something like this:

For this example, I have not included columns for Level 2, and this is what struck me. You are the better judge of your data, but still, we can’t anticipate the levels of escalation as definite steps. So this may cause some trouble for deciding the structure of the table.

  1. Another problem that will arise due to the dissimilarity or completely different pattern of inputs. Like I have taken the liberty here to still have a formatted response. However, the three lines you have given in your earlier post itself are so different.

Now, this second cleaning I guess is where pattern search, identifying common patterns, writing an algorithm to do our job, and those big words come in. So a really cool SQL will be required here to segregate this column into additional columns or a different table for these columns joined with the original table using some primary-foreign key.

Even if we use a DB as storage space and then manipulate the data using say Python (pulling data into data frame etc.), we would still need a very strong code to refine this structure.

Please ignore if you have already come across these (and similar) posts:

As of now, I am highlighting this query in the moderator’s lounge. Hopefully, you might get a “structured” response from them perhaps :grin:

Oh and I have no prior experience with NoSQL, so I will have to look that up :stuck_out_tongue:

2 Likes

oh nice, let me try that out :slight_smile:

Hi @willx

If we talk about data storage, there is no difference in what type of database you choose SQL or NoSQL.

Column escolania is raw text. And for any further processing, you need to first create the logic for parsing the text into some kind of data structure.

So if we are talking exclusively about storage you can use both SQL and NoSQL to store your data in the same format as now. I would also recommend adding a column with a unique event number.
Let it be for example this will be an event registration table.

But after that, you will have to create logic for processing text in a different data structure in both NoSQL and SQL, and this will affect which structure you choose.
I can give you 2 examples. using SQL and Mongo

If you select SQL

You need to create a new table that will have a column associated with the event id + a flat table structure that you converted your text to.

For example, these can be columns.
Employees involved
Action is required
Actions were taken
Status

But you should understand that you will need to develop a structure that gives the maximum possible information and, accordingly, you must somehow process all the texts to fill in this structure.

If you choose Mongo
Then your structure can be multi-level in JSON format (if you are not familiar with json, think of it as a structure of nested lists and dictionaries)

Then for example you can have a structure
{General status: ‘Active’,
Employees involved: [{name:", action: ", action status:},
{name: ", action: ", action status:}]}

Each document can have any depth and shape of this structure.

Therefore, I believe that you should start from how you see the result of processing your data and interaction with them. SQL databases like postgres also support storing JSON in columns but work with them much lower.

Sometimes when choosing a database, you can start from a trivial rule.

If you can present your data in a flat table structure with a rigid hierarchy, then choose SQL. If you have no idea what to expect from the data and what final structure you will arrive at then start with NoSQL

2 Likes

oh that’s a good explanation @moriturus7.

What if for SQL, i have a column where it is a narrative of sequence of events. e.g.
i want to find out general trends / modus operandi in that column of “Escalation”. i cannot really structure it.
the data is really a Copy+Paste from email threads to capture the narrative of what happened. Can i just take it, then run some stopwords to take it out?

Hi @willx

It all depends on how you plan to work with data in the future.

If you just need to use keyword search to filter data, you can use LIKE

For example

SELECT
       *
FROM
	data_table
WHERE
	escalation LIKE '%support%';

If you have a large table and will use PostgreSQL you can use trigram indexing to speed up your search.

But if possible, I would recommend that you make notes about what information you are ultimately trying to get for yourself from this column. Perhaps over time, this will allow you to come up with a table with columns in which you can decompose this data, which will make it easier to work with them in the future. And it will only come down to writing code that will convert the text of the message chain into the necessary data.

Often it is the routine actions that we perform that are the Foundation for simplifying the processing of any data.